如何在AWK中打印JSON对象

如何在AWK中打印JSON对象,json,awk,text-mining,Json,Awk,Text Mining,我在寻找awk中的一些内置函数,以便轻松生成JSON对象。我找到了几个答案,并决定创建自己的答案 我希望从多维数组生成JSON,在多维数组中存储表样式的数据,并使用JSON模式的单独动态定义从该数据生成 期望输出: { "Name": JanA "Surname": NowakA "ID": 1234A "Role": PrezesA } { "Name": JanD "Surname": NowakD "ID": 12341D "Role": PrezesD } { "Name": JanC

我在寻找awk中的一些内置函数,以便轻松生成JSON对象。我找到了几个答案,并决定创建自己的答案

我希望从多维数组生成JSON,在多维数组中存储表样式的数据,并使用JSON模式的单独动态定义从该数据生成

期望输出:

{
"Name": JanA
"Surname": NowakA
"ID": 1234A
"Role": PrezesA
}
{
"Name": JanD
"Surname": NowakD
"ID": 12341D
"Role": PrezesD
}
{
"Name": JanC
"Surname": NowakC
"ID": 12342C
"Role": PrezesC
}
输入文件:

pierwsza linia
druga linia
trzecia linia

dane wspólników
imie JanA
nazwisko NowakA
pesel 11111111111A
funkcja PrezesA

imie Ja"nD
nazwisko NowakD
pesel 11111111111
funkcja PrezesD

imie JanC
nazwisko NowakC
pesel 12342C
funkcja PrezesC

czwarta linia

reprezentanci

imie Tomek
根据输入文件,我创建了一个多维数组:

JanA  NowaA 1234A PrezesA
JanD  NowakD 12341D PrezesD
JanC  NowakC 12342C PrezesC

我更新的简单阵列打印机的awk实现,每个列都有基于正则表达式的验证(使用gawk运行):


我要尝试一个笨拙的解决方案。缩进并不完美,结果也没有排序(请参阅下面的“排序”注释),但它至少能够递归遍历真正的多维数组,并且应该从任何数组生成有效的、可解析的JSON奖励:数据数组就是架构。数组键变成JSON键。除了数据数组之外,不需要创建单独的模式数组

只需确保使用构建数据数组的
数组[d1][d2][d3]…
约定,而不是
数组[d1,d2,d3…]
约定即可

更新: 我得到了一个更新的JSON gawk脚本,发布为。尽管下面的脚本被测试为与OP的数据一起工作,但自从上次编辑这篇文章以来,我可能已经做了改进。请参阅经过最彻底测试的bug压缩版本的要点

#!/usr/bin/gawk -f

BEGIN { IGNORECASE = 1 }

$1 ~ "imie" { record[++idx]["name"] = $2 }
$1 ~ "nazwisko" { record[idx]["surname"] = $2 }
$1 ~ "pesel" { record[idx]["ID"] = $2 }
$1 ~ "funkcja" { record[idx]["role"] = $2 }

END { print serialize(record, "\t") }

# ==== FUNCTIONS ====

function join(arr, sep, _p, i) {
    # syntax: join(array, string separator)
    # returns a string

    for (i in arr) {
        _p["result"] = _p["result"] ~ "[[:print:]]" ? _p["result"] sep arr[i] : arr[i]
    }
    return _p["result"]
}

function quote(str) {
    gsub(/\\/, "\\\\", str)
    gsub(/\r/, "\\r", str)
    gsub(/\n/, "\\n", str)
    gsub(/\t/, "\\t", str)
    return "\"" str "\""
}

function serialize(arr, indent_with, depth, _p, i, idx) {
    # syntax: serialize(array of arrays, indent string)
    # returns a JSON formatted string

    # sort arrays on key, ensures [...] values remain properly ordered
    if (!PROCINFO["sorted_in"]) PROCINFO["sorted_in"] = "@ind_num_asc"

    # determine whether array is indexed or associative
    for (i in arr) {
        _p["assoc"] = or(_p["assoc"], !(++_p["idx"] in arr))
    }

    # if associative, indent
    if (_p["assoc"]) {
        for (i = ++depth; i--;) {
            _p["end"] = _p["indent"]; _p["indent"] = _p["indent"] indent_with
        }
    }

    for (i in arr) {
        # If key length is 0, assume its an empty object
        if (!length(i)) return "{}"

        # quote key if not already quoted
        _p["key"] = i !~ /^".*"$/ ? quote(i) : i

        if (isarray(arr[i])) {
            if (_p["assoc"]) {
                _p["json"][++idx] = _p["indent"] _p["key"] ": " \
                    serialize(arr[i], indent_with, depth)
            } else {
                # if indexed array, dont print keys
                _p["json"][++idx] = serialize(arr[i], indent_with, depth)
            }
        } else {
            # quote if not numeric, boolean, null, already quoted, or too big for match()
            if (!((arr[i] ~ /^[0-9]+([\.e][0-9]+)?$/ && arr[i] !~ /^0[0-9]/) ||
                arr[i] ~ /^true|false|null|".*"$/) || length(arr[i]) > 1000)
                arr[i] = quote(arr[i])

            _p["json"][++idx] = _p["assoc"] ? _p["indent"] _p["key"] ": " arr[i] : arr[i]
        }
    }

    # I trial and errored the hell out of this. Problem is, gawk cant distinguish between
    # a value of null and no value.  I think this hack is as close as I can get, although
    # [""] will become [].
    if (!_p["assoc"] && join(_p["json"]) == "\"\"") return "[]"

    # surround with curly braces if object, square brackets if array
    return _p["assoc"] ? "{\n" join(_p["json"], ",\n") "\n" _p["end"] "}" \
        : "[" join(_p["json"], ", ") "]"
}
OP示例数据产生的输出:

[{
“ID”:“1234A”,
“姓名”:“JanA”,
“角色”:“PrezesA”,
“姓氏”:“NowakA”
}, {
“ID”:“12341D”,
“姓名”:“JanD”,
“角色”:“特权”,
“姓氏”:“诺瓦克”
}, {
“ID”:“12342C”,
“名称”:“JanC”,
“角色”:“PrezesC”,
“姓氏”:“诺瓦克”
}, {
“名称”:“托梅克”
}]

分类 虽然默认情况下,结果的排序方式只有gawk能够理解,但gawk也可以对字段中的结果进行排序。例如,如果要在ID字段上排序,请添加以下函数:

function cmp_ID(i1, v1, i2, v2) {
    if (!isarray(v1) && v1 ~ /"ID"/ ) {
        return v1 < v2 ? -1 : (v1 != v2)
    }
}

有关更多信息,请参阅。

您可以使用
函数trim(s){gsub(/^\s+|\s+$/,“”,s)简化字符串修剪;返回s}
@djsowa,
awk
此处添加的解决方案也可以帮助您。请检查一下,然后告诉我。您想要的输出不是有效的JSON。尝试将其粘贴到中。谢谢@rojo在您的示例中有许多基本的gawk知识,我将根据自己的情况进行调整。我只是选择自己的实现,因为我需要分离数据和模式。
#!/usr/bin/gawk -f

BEGIN { IGNORECASE = 1 }

$1 ~ "imie" { record[++idx]["name"] = $2 }
$1 ~ "nazwisko" { record[idx]["surname"] = $2 }
$1 ~ "pesel" { record[idx]["ID"] = $2 }
$1 ~ "funkcja" { record[idx]["role"] = $2 }

END { print serialize(record, "\t") }

# ==== FUNCTIONS ====

function join(arr, sep, _p, i) {
    # syntax: join(array, string separator)
    # returns a string

    for (i in arr) {
        _p["result"] = _p["result"] ~ "[[:print:]]" ? _p["result"] sep arr[i] : arr[i]
    }
    return _p["result"]
}

function quote(str) {
    gsub(/\\/, "\\\\", str)
    gsub(/\r/, "\\r", str)
    gsub(/\n/, "\\n", str)
    gsub(/\t/, "\\t", str)
    return "\"" str "\""
}

function serialize(arr, indent_with, depth, _p, i, idx) {
    # syntax: serialize(array of arrays, indent string)
    # returns a JSON formatted string

    # sort arrays on key, ensures [...] values remain properly ordered
    if (!PROCINFO["sorted_in"]) PROCINFO["sorted_in"] = "@ind_num_asc"

    # determine whether array is indexed or associative
    for (i in arr) {
        _p["assoc"] = or(_p["assoc"], !(++_p["idx"] in arr))
    }

    # if associative, indent
    if (_p["assoc"]) {
        for (i = ++depth; i--;) {
            _p["end"] = _p["indent"]; _p["indent"] = _p["indent"] indent_with
        }
    }

    for (i in arr) {
        # If key length is 0, assume its an empty object
        if (!length(i)) return "{}"

        # quote key if not already quoted
        _p["key"] = i !~ /^".*"$/ ? quote(i) : i

        if (isarray(arr[i])) {
            if (_p["assoc"]) {
                _p["json"][++idx] = _p["indent"] _p["key"] ": " \
                    serialize(arr[i], indent_with, depth)
            } else {
                # if indexed array, dont print keys
                _p["json"][++idx] = serialize(arr[i], indent_with, depth)
            }
        } else {
            # quote if not numeric, boolean, null, already quoted, or too big for match()
            if (!((arr[i] ~ /^[0-9]+([\.e][0-9]+)?$/ && arr[i] !~ /^0[0-9]/) ||
                arr[i] ~ /^true|false|null|".*"$/) || length(arr[i]) > 1000)
                arr[i] = quote(arr[i])

            _p["json"][++idx] = _p["assoc"] ? _p["indent"] _p["key"] ": " arr[i] : arr[i]
        }
    }

    # I trial and errored the hell out of this. Problem is, gawk cant distinguish between
    # a value of null and no value.  I think this hack is as close as I can get, although
    # [""] will become [].
    if (!_p["assoc"] && join(_p["json"]) == "\"\"") return "[]"

    # surround with curly braces if object, square brackets if array
    return _p["assoc"] ? "{\n" join(_p["json"], ",\n") "\n" _p["end"] "}" \
        : "[" join(_p["json"], ", ") "]"
}
function cmp_ID(i1, v1, i2, v2) {
    if (!isarray(v1) && v1 ~ /"ID"/ ) {
        return v1 < v2 ? -1 : (v1 != v2)
    }
}
PROCINFO["sorted_in"] = "cmp_ID"