Parsing Anypoint Studio中的平面文件(固定宽度)架构创建/分析错误

Parsing Anypoint Studio中的平面文件(固定宽度)架构创建/分析错误,parsing,mule,schema,flat-file,anypoint-studio,Parsing,Mule,Schema,Flat File,Anypoint Studio,我试图解析一个固定宽度的平面文件(标题和详细信息类型记录),该文件没有用于标识段的重复/定义的标记值。当我试图在Anypoint Studio中处理文件(简单转换为json格式)时,我收到一条错误消息“java.lang.IllegalStateException:未定义段”。我知道模式需要修正,但我没有想法去尝试 如果有人能从Anypoint studio的角度指出问题所在,我将不胜感激 模式: form: FIXEDWIDTH structures: - id: 'flatfile' n

我试图解析一个固定宽度的平面文件(标题和详细信息类型记录),该文件没有用于标识段的重复/定义的标记值。当我试图在Anypoint Studio中处理文件(简单转换为json格式)时,我收到一条错误消息“java.lang.IllegalStateException:未定义段”。我知道模式需要修正,但我没有想法去尝试

如果有人能从Anypoint studio的角度指出问题所在,我将不胜感激

模式:

form: FIXEDWIDTH
structures:
- id: 'flatfile'
  name: flatfile
  tagStart: 0
  data:
  - { idRef: 'Header', count: 1}
  - { idRef: 'Items', count: 99, usage: O}  
segments:
- id: 'Header'
  name: Header
  values:
 - { name: 'PCBCode', type: String, length: 8 }
 - { name: 'NumberTG', type: String, length: 17 }
 - { name: 'TopSort', type: String, length: 1 }
 - { name: 'InternalRef', type: String, length: 5 }
 - { name: 'DateInt', type: String, length: 26 }
 - { name: 'DAT', type: String, length: 26 }
 - { name: 'DIN', type: String, length: 26 }
 - { name: 'DLN', type: String, length: 26 }
 - { name: 'DON', type: String, length: 26 }
 - { name: 'Sort', type: String, length: 10 }
 - { name: 'NameCharter', type: String, length: 35 }
 - { name: 'NumberReg', type: String, length: 17 }
 - { name: 'NatTruck', type: String, length: 3 }
 - { name: 'NumRemarks', type: String, length: 17 }
 - { name: 'NatRemarks', type: String, length: 3 }
 - { name: 'Weight', type: String, length: 6 }
 - { name: 'Remarks', type: String, length: 35 }
- id: 'Items'
  name: Items
  values:
 - { name: 'TVNum', type: String, length: 17 }
 - { name: 'Load', type: String, length: 1 }
 - { name: 'Flag', type: String, length: 1 }
 - { name: 'col', type: String, length: 17 }
下面的样本数据长度为4000

BCD_VAN 180223G04467     N377612018-02-23-13.57.15.7722282018-02-26-13.21.26.3305841901-01-01-00.00.00.0000001901-01-01-00.00.00.0000001901-01-01-00.00.00.000000          TAURUS                             W1TRS19          PL WWL72142         PL 000000                                   G18GKJ99-690851                     G18GKJ96-690851                     G18GKJ22-685131                     G18GKJ00-668701                     G18GGX99-668701                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  

通过Python切片的魔力,固定宽度的数据很容易处理。切片是可以用于从序列中“切片”片段的对象,无论序列是字符串、列表、元组还是支持索引寻址的任何其他序列。假设您拥有字符串
rec=“BLAHXXIMPORTANT DATAXXBLAH”
。您可以使用
rec[6:20]
提取重要数据。您还可以使用
data\u slice=slice(6,20)
创建切片,然后使用
rec[data\u slice]
rec
获取值

下面是示例数据记录的提取器,它通过解析字段规范来创建切片及其关联名称:

layout = """\
 - { name: 'PCBCode', type: String, length: 8 }
 - { name: 'NumberTG', type: String, length: 17 }
 - { name: 'TopSort', type: String, length: 1 }
 - { name: 'InternalRef', type: String, length: 5 }
 - { name: 'DateInt', type: String, length: 26 }
 - { name: 'DAT', type: String, length: 26 }
 - { name: 'DIN', type: String, length: 26 }
 - { name: 'DLN', type: String, length: 26 }
 - { name: 'DON', type: String, length: 26 }
 - { name: 'Sort', type: String, length: 10 }
 - { name: 'NameCharter', type: String, length: 35 }
 - { name: 'NumberReg', type: String, length: 17 }
 - { name: 'NatTruck', type: String, length: 3 }
 - { name: 'NumRemarks', type: String, length: 17 }
 - { name: 'NatRemarks', type: String, length: 3 }
 - { name: 'Weight', type: String, length: 6 }
 - { name: 'Remarks', type: String, length: 35 }
 """

# build data slicer - list of names and slices for each field in the fixed format input
slicer = []
cur = 0
for line in layout.splitlines():
    # split the line on whitespace, will give a list like:
    #   ['-', '{', 'name:', "'PCBCode',", 'type:', 'String,', 'length:', '8', '}']
    # the name is in element 3 (we start with 0), and the integer length 
    # is second from last, so we can use index -2 to get it
    parts = line.split()
    if not parts:
        continue
    slice_name = parts[3].strip("',")
    slice_len = int(parts[-2])
    slicer.append((slice_name, slice(cur, cur+slice_len)))
    cur += slice_len

# print out the names and slices
for slc in slicer:
    print(slc)
print()
印刷品:

('PCBCode', slice(0, 8, None))
('NumberTG', slice(8, 25, None))
('TopSort', slice(25, 26, None))
('InternalRef', slice(26, 31, None))
('DateInt', slice(31, 57, None))
('DAT', slice(57, 83, None))
('DIN', slice(83, 109, None))
('DLN', slice(109, 135, None))
('DON', slice(135, 161, None))
('Sort', slice(161, 171, None))
('NameCharter', slice(171, 206, None))
('NumberReg', slice(206, 223, None))
('NatTruck', slice(223, 226, None))
('NumRemarks', slice(226, 243, None))
('NatRemarks', slice(243, 246, None))
('Weight', slice(246, 252, None))
('Remarks', slice(252, 287, None))
{
  "Remarks": "",
  "PCBCode": "BCD_VAN",
  "NatTruck": "PL",
  "DateInt": "2018-02-23-13.57.15.772228",
  "DAT": "2018-02-26-13.21.26.330584",
  "NumRemarks": "WWL72142",
  "Weight": "000000",
  "DIN": "1901-01-01-00.00.00.000000",
  "Sort": "",
  "InternalRef": "37761",
  "NumberTG": "180223G04467",
  "DLN": "1901-01-01-00.00.00.000000",
  "TopSort": "N",
  "NatRemarks": "PL",
  "NumberReg": "W1TRS19",
  "NameCharter": "TAURUS",
  "DON": "1901-01-01-00.00.00.000000"
}
现在,您可以使用切片(类似于小的
(开始、结束、步骤)
三元组,就像您用来索引到包含
数据的字符串[start:end:step]
)及其关联名称来构建dict

# a simple method to slice up a fixed format data line with a slicer, strips trailing spaces from fields
def extract(slicer, data_line):
    return {name: data_line[data_slice].strip() for name, data_slice in slicer}
它与您的数据的外观:

# try it out
data = "BCD_VAN 180223G04467     N377612018-02-23-13.57.15.7722282018-02-26-13.21.26.3305841901-01-01-00.00.00.0000001901-01-01-00.00.00.0000001901-01-01-00.00.00.000000          TAURUS                             W1TRS19          PL WWL72142         PL 000000                                   G18GKJ99-690851                     G18GKJ96-690851                     G18GKJ22-685131                     G18GKJ00-668701                     G18GGX99-668701                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "
data_dict = extract(slicer, data)

# output as JSON
import json
print(json.dumps(data_dict, indent=2))
印刷品:

('PCBCode', slice(0, 8, None))
('NumberTG', slice(8, 25, None))
('TopSort', slice(25, 26, None))
('InternalRef', slice(26, 31, None))
('DateInt', slice(31, 57, None))
('DAT', slice(57, 83, None))
('DIN', slice(83, 109, None))
('DLN', slice(109, 135, None))
('DON', slice(135, 161, None))
('Sort', slice(161, 171, None))
('NameCharter', slice(171, 206, None))
('NumberReg', slice(206, 223, None))
('NatTruck', slice(223, 226, None))
('NumRemarks', slice(226, 243, None))
('NatRemarks', slice(243, 246, None))
('Weight', slice(246, 252, None))
('Remarks', slice(252, 287, None))
{
  "Remarks": "",
  "PCBCode": "BCD_VAN",
  "NatTruck": "PL",
  "DateInt": "2018-02-23-13.57.15.772228",
  "DAT": "2018-02-26-13.21.26.330584",
  "NumRemarks": "WWL72142",
  "Weight": "000000",
  "DIN": "1901-01-01-00.00.00.000000",
  "Sort": "",
  "InternalRef": "37761",
  "NumberTG": "180223G04467",
  "DLN": "1901-01-01-00.00.00.000000",
  "TopSort": "N",
  "NatRemarks": "PL",
  "NumberReg": "W1TRS19",
  "NameCharter": "TAURUS",
  "DON": "1901-01-01-00.00.00.000000"
}

感谢Paul分享详细的解决方案。我以前没有使用过python,因此需要一些时间来理解它的工作原理。不过,我要进一步解释一下我面临的问题是关于items迭代(最多可以重复99次)。当我试图通过分开标题和项目来验证输入时,我可以轻松地解析输入,但是当我试图解析输入时,它停止工作并抛出“未定义段”错误。我正试图避免整个分裂场景。对此的任何想法都非常感谢。“未定义段”不是Python错误,而是Anypoint Studio错误,因此您应该使用该产品的资源来查看错误是什么。我将对关于切片的回答做更多的解释,但我并没有发布您的问题的完整解决方案,只是一个如何破解固定宽度格式的示例。