Regex 正则表达式来解析一个有趣的CSV？_Regex_Csv_Awk

Regex 正则表达式来解析一个有趣的CSV？

regex csv awk

Regex 正则表达式来解析一个有趣的CSV？,regex,csv,awk,Regex,Csv,Awk,我需要使用AWK解析CSV文件。CSV中的一行可以如下所示： "hello, world?",1 thousand,"oneword",,,"last one" 一些重要的意见： -带引号字符串内的字段可以包含逗号和多个单词 -未加引号的字段可以是多个世界 -字段可以为空，只要在一行中有两个逗号即可有没有关于编写正则表达式以正确分割此行的线索谢谢试试这个： ^(("(?:[^"]|"")*"|[^,]*)(,("(?:[^"]|"")*"|[^,]*))*)$ 不过，我还没有用AWK对

我需要使用AWK解析CSV文件。CSV中的一行可以如下所示：

"hello, world?",1 thousand,"oneword",,,"last one"

一些重要的意见： -带引号字符串内的字段可以包含逗号和多个单词 -未加引号的字段可以是多个世界 -字段可以为空，只要在一行中有两个逗号即可

有没有关于编写正则表达式以正确分割此行的线索

谢谢

试试这个：

^(("(?:[^"]|"")*"|[^,]*)(,("(?:[^"]|"")*"|[^,]*))*)$

不过，我还没有用AWK对其进行测试。

正如许多人所观察到的，CSV是一种比最初出现时更难的格式。存在许多边缘情况和歧义。例如，在您的示例中，“，”是带有逗号的字段还是两个空白字段

Perl、python、Java等更适合处理CSV，因为它们都有经过良好测试的库。正则表达式将更加脆弱

有了AWK，我在AWK函数方面取得了一些成功。它在AWK、gawk和nawk下工作

#!/usr/bin/awk -f
#**************************************************************************
#
# This file is in the public domain.
#
# For more information email LoranceStinson+csv@gmail.com.
# Or see http://lorance.freeshell.org/csv/
#
# Parse a CSV string into an array.
# The number of fields found is returned.
# In the event of an error a negative value is returned and csverr is set to
# the error. See below for the error values.
#
# Parameters:
# string  = The string to parse.
# csv     = The array to parse the fields into.
# sep     = The field separator character. Normally ,
# quote   = The string quote character. Normally "
# escape  = The quote escape character. Normally "
# newline = Handle embedded newlines. Provide either a newline or the
#           string to use in place of a newline. If left empty embedded
#           newlines cause an error.
# trim    = When true spaces around the separator are removed.
#           This affects parsing. Without this a space between the
#           separator and quote result in the quote being ignored.
#
# These variables are private:
# fields  = The number of fields found thus far.
# pos     = Where to pull a field from the string.
# strtrim = True when a string is found so we know to remove the quotes.
#
# Error conditions:
# -1  = Unable to read the next line.
# -2  = Missing end quote.
# -3  = Missing separator.
#
# Notes:
# The code assumes that every field is preceded by a separator, even the
# first field. This makes the logic much simpler, but also requires a
# separator be prepended to the string before parsing.
#**************************************************************************
function parse_csv(string,csv,sep,quote,escape,newline,trim, fields,pos,strtrim) {
    # Make sure there is something to parse.
    if (length(string) == 0) return 0;
    string = sep string; # The code below assumes ,FIELD.
    fields = 0; # The number of fields found thus far.
    while (length(string) > 0) {
        # Remove spaces after the separator if requested.
        if (trim && substr(string, 2, 1) == " ") {
            if (length(string) == 1) return fields;
            string = substr(string, 2);
            continue;
        }
        strtrim = 0; # Used to trim quotes off strings.
        # Handle a quoted field.
        if (substr(string, 2, 1) == quote) {
            pos = 2;
            do {
                pos++
                if (pos != length(string) &&
                    substr(string, pos, 1) == escape &&
                    (substr(string, pos + 1, 1) == quote ||
                     substr(string, pos + 1, 1) == escape)) {
                    # Remove escaped quote characters.
                    string = substr(string, 1, pos - 1) substr(string, pos + 1);
                } else if (substr(string, pos, 1) == quote) {
                    # Found the end of the string.
                    strtrim = 1;
                } else if (newline && pos >= length(string)) {
                    # Handle embedded newlines if requested.
                    if (getline == -1) {
                        csverr = "Unable to read the next line.";
                        return -1;
                    }
                    string = string newline $0;
                }
            } while (pos < length(string) && strtrim == 0)
            if (strtrim == 0) {
                csverr = "Missing end quote.";
                return -2;
            }
        } else {
            # Handle an empty field.
            if (length(string) == 1 || substr(string, 2, 1) == sep) {
                csv[fields] = "";
                fields++;
                if (length(string) == 1)
                    return fields;
                string = substr(string, 2);
                continue;
            }
            # Search for a separator.
            pos = index(substr(string, 2), sep);
            # If there is no separator the rest of the string is a field.
            if (pos == 0) {
                csv[fields] = substr(string, 2);
                fields++;
                return fields;
            }
        }
        # Remove spaces after the separator if requested.
        if (trim && pos != length(string) && substr(string, pos + strtrim, 1) == " ") {
            trim = strtrim
            # Count the number fo spaces found.
            while (pos < length(string) && substr(string, pos + trim, 1) == " ") {
                trim++
            }
            # Remove them from the string.
            string = substr(string, 1, pos + strtrim - 1) substr(string,  pos + trim);
            # Adjust pos with the trimmed spaces if a quotes string was not found.
            if (!strtrim) {
                pos -= trim;
            }
        }
        # Make sure we are at the end of the string or there is a separator.
        if ((pos != length(string) && substr(string, pos + 1, 1) != sep)) {
            csverr = "Missing separator.";
            return -3;
        }
        # Gather the field.
        csv[fields] = substr(string, 2 + strtrim, pos - (1 + strtrim * 2));
        fields++;
        # Remove the field from the string for the next pass.
        string = substr(string, pos + 1);
    }
    return fields;
}

{
    num_fields = parse_csv($0, csv, ",", "\"", "\"", "\\n", 1);
    if (num_fields < 0) {
        printf "ERROR: %s (%d) -> %s\n", csverr, num_fields, $0;
    } else {
        printf "%s -> \n", $0;
        printf "%s fields\n", num_fields;
        for (i = 0;i < num_fields;i++) {
            printf "%s\n", csv[i];
        }
        printf "|\n";
    }
}

Perl解决方案示例：

$ echo '"hello, world?",1 thousand,"oneword",,,"last one"' | 
perl -lnE 'for(/(?:^|,)("(?:[^"]+|"")*"|[^,]*)/g) { s/"$//; s/""/"/g if (s/^"//);
say}'

你真的需要使用awk吗？有很多语言具有内置的CSV解析器。（？：^，）（（？：[^”]++）*“[^，]*）AWK不执行非捕获子模式。

$ echo '"hello, world?",1 thousand,"oneword",,,"last one"' | 
perl -lnE 'for(/(?:^|,)("(?:[^"]+|"")*"|[^,]*)/g) { s/"$//; s/""/"/g if (s/^"//);
say}'