Linux 使用awk忽略CSV文件字段中的逗号
我试图从CSV文件最后一行的第二个字段中获取一个数字。到目前为止,我有:Linux 使用awk忽略CSV文件字段中的逗号,linux,csv,parsing,awk,command-line,Linux,Csv,Parsing,Awk,Command Line,我试图从CSV文件最后一行的第二个字段中获取一个数字。到目前为止,我有: awk -F"," 'END {print $2}' /file/path/fileName.csv 这是可行的,除非最后一行的第一个字段中有逗号。对于这样的一排 "Company Name, LLC", 12345, Type1, SubType3 …如果“Company Name,LLC”实际上是第一个字段,则awk命令将返回LLC 如何忽略第一个字段中的逗号,以便在第二个字段中获取信息?我认为您的要求是在GNU
awk -F"," 'END {print $2}' /file/path/fileName.csv
这是可行的,除非最后一行的第一个字段中有逗号。对于这样的一排
"Company Name, LLC", 12345, Type1, SubType3
…如果“Company Name,LLC”
实际上是第一个字段,则awk
命令将返回LLC
如何忽略第一个字段中的逗号,以便在第二个字段中获取信息?我认为您的要求是在
GNU Awk
中使用FPAT
的最佳用例
引述原文
通常,当使用FS
时,gawk
将字段定义为记录中出现在每个字段分隔符之间的部分。换句话说,FS
定义字段不是什么,而不是字段是什么。但是,有时您确实希望根据字段的性质来定义字段,而不是根据字段的性质来定义字段
这种情况最臭名昭著的是所谓的逗号分隔值(CSV)数据。如果逗号只分隔数据,就不会有问题。当其中一个字段包含嵌入的逗号时,就会出现问题。在这种情况下,大多数程序将字段嵌入双引号中
对于此处显示的CSV数据,每个字段要么是“非逗号的任何内容”,要么是“双引号、非双引号的任何内容和结束双引号”。如果将其作为正则表达式常量写入(请参见Regexp),我们将有/([^,]+)|(“[^”]+”)/
。将其作为字符串写入需要我们转义双引号,从而导致:
FPAT = "([^,]+)|(\"[^\"]+\")"
在你的输入文件中使用它
awk 'BEGIN{FPAT = "([^,]+)|(\"[^\"]+\")"}{print $1}' file
"Company Name, LLC"
这个问题没有一般的答案,因为正则表达式的功能不足以解析csv(在一般情况下)。My是一个C程序,使用有限状态机预处理输入,其输出可以输入到Awk:
/* NAME
*
* csv -- convert comma-separated values file to character-delimited
*
*
* SYNOPSIS
*
* csv [-Cc] [-Fc] [filename ...]
*
*
* DESCRIPTION
*
* Csv reads from standard input or from one or more files named on
* the command line a sequence of records in comma-separated values
* format and writes on standard output the same records in character-
* delimited format. Csv returns 0 on success, 1 for option errors,
* and 2 if any file couldn't be opened.
*
* The comma-separated values format has developed over time as a
* set of conventions that has never been formally defined, and some
* implementations are in conflict about some of the details. In
* general, the comma-separated values format is used by databases,
* spreadsheets, and other programs that need to write data consisting
* of records containing fields. The data is written as ascii text,
* with records terminated by newlines and fields containing zero or
* more characters separated by commas. Leading and trailing space in
* unquoted fields is preserved. Fields may be surrounded by double-
* quote characters (ascii \042); such fields may contain newlines,
* literal commas (ascii \054), and double-quote characters
* represented as two successive double-quotes. The examples shown
* below clarify many irregular situations that may arise.
*
* The field separator is normally a comma, but can be changed to an
* arbitrary character c with the command line option -Cc. This is
* useful in those european countries that use a comma instead of a
* decimal point, where the field separator is normally changed to a
* semicolon.
*
* Character-delimited format has records terminated by newlines and
* fields separated by a single character, which is \034 by default
* but may be changed with the -Fc option on the command line.
*
*
* EXAMPLE
*
* Each record below has five fields. For readability, the three-
* character sequence TAB represents a single tab character (ascii
* \011).
*
* $ cat testdata.csv
* 1,abc,def ghi,jkl,unquoted character strings
* 2,"abc","def ghi","jkl",quoted character strings
* 3,123,456,789,numbers
* 4, abc,def , ghi ,strings with whitespace
* 5, "abc","def" , "ghi" ,quoted strings with whitespace
* 6, 123,456 , 789 ,numbers with whitespace
* 7,TAB123,456TAB,TAB789TAB,numbers with tabs for whitespace
* 8, -123, +456, 1E3,more numbers with whitespace
* 9,123 456,123"456, 123 456 ,strange numbers
* 10,abc",de"f,g"hi,embedded quotes
* 11,"abc""","de""f","g""hi",quoted embedded quotes
* 12,"","" "",""x"",doubled quotes
* 13,"abc"def,abc"def","abc" "def",strange quotes
* 14,,"", ,empty fields
* 15,abc,"def
* ghi",jkl,embedded newline
* 16,abc,"def",789,multiple types of fields
*
* $ csv -F'|' testdata.csv
* 1|abc|def ghi|jkl|unquoted character strings
* 2|abc|def ghi|jkl|quoted character strings
* 3|123|456|789|numbers
* 4| abc|def | ghi |strings with whitespace
* 5| "abc"|def | "ghi" |quoted strings with whitespace
* 6| 123|456 | 789 |numbers with whitespace
* 7|TAB123|456TAB|TAB789TAB|numbers with tabs for whitespace
* 8| -123| +456| 1E3|more numbers with whitespace
* 9|123 456|123"456| 123 456 |strange numbers
* 10|abc"|de"f|g"hi|embedded quotes
* 11|abc"|de"f|g"hi|quoted embedded quotes
* 12|| ""|x""|doubled quotes
* 13|abcdef|abc"def"|abc "def"|strange quotes
* 14||| |empty fields
* 15|abc|def
* ghi|jkl|embedded newline
* 16|abc|def|789|multiple types of fields
*
* It is particularly easy to pipe the output from csv into any of
* the unix tools that accept character-delimited fielded text data
* files, such as sort, join, or cut. For example:
*
* csv datafile.csv | awk -F'\034' -f program.awk
*
*
* BUGS
*
* On DOS, Windows, and OS/2 systems, processing of each file stops
* at the first appearance of the ascii \032 (control-Z) end of file
* character.
*
* Because newlines embedded in quoted fields are treated literally,
* a missing closing quote can suck up all remaining input.
*
*
* LICENSE
*
* This program was written by Philip L. Bewig of Saint Louis,
* Missouri, United States of America on February 28, 2002 and
* placed in the public domain.
*/
#include <stdio.h>
/* dofile -- convert one file from comma-separated to delimited */
void dofile(char ofs, char fs, FILE *f) {
int c; /* current input character */
START:
c = fgetc(f);
if (c == EOF) { return; }
if (c == '\r') { goto CARRIAGE_RETURN; }
if (c == '\n') { goto LINE_FEED; }
if (c == '\"') { goto QUOTED_FIELD; }
if (c == fs) { putchar(ofs); goto NOT_FIELD; }
/* default */ { putchar(c); goto UNQUOTED_FIELD; }
NOT_FIELD:
c = fgetc(f);
if (c == EOF) { putchar('\n'); return; }
if (c == '\r') { goto CARRIAGE_RETURN; }
if (c == '\n') { goto LINE_FEED; }
if (c == '\"') { goto QUOTED_FIELD; }
if (c == fs) { putchar(ofs); goto NOT_FIELD; }
/* default */ { putchar(c); goto UNQUOTED_FIELD; }
QUOTED_FIELD:
c = fgetc(f);
if (c == EOF) { putchar('\n'); return; }
if (c == '\"') { goto MAY_BE_DOUBLED_QUOTES; }
/* default */ { putchar(c); goto QUOTED_FIELD; }
MAY_BE_DOUBLED_QUOTES:
c = fgetc(f);
if (c == EOF) { putchar('\n'); return; }
if (c == '\r') { goto CARRIAGE_RETURN; }
if (c == '\n') { goto LINE_FEED; }
if (c == '\"') { putchar('\"'); goto QUOTED_FIELD; }
if (c == fs) { putchar(ofs); goto NOT_FIELD; }
/* default */ { putchar(c); goto UNQUOTED_FIELD; }
UNQUOTED_FIELD:
c = fgetc(f);
if (c == EOF) { putchar('\n'); return; }
if (c == '\r') { goto CARRIAGE_RETURN; }
if (c == '\n') { goto LINE_FEED; }
if (c == fs) { putchar(ofs); goto NOT_FIELD; }
/* default */ { putchar(c); goto UNQUOTED_FIELD; }
CARRIAGE_RETURN:
c = fgetc(f);
if (c == EOF) { putchar('\n'); return; }
if (c == '\r') { putchar('\n'); goto CARRIAGE_RETURN; }
if (c == '\n') { putchar('\n'); goto START; }
if (c == '\"') { putchar('\n'); goto QUOTED_FIELD; }
if (c == fs) { printf("\n%c",ofs); goto NOT_FIELD; }
/* default */ { printf("\n%c",c); goto UNQUOTED_FIELD; }
LINE_FEED:
c = fgetc(f);
if (c == EOF) { putchar('\n'); return; }
if (c == '\r') { putchar('\n'); goto START; }
if (c == '\n') { putchar('\n'); goto LINE_FEED; }
if (c == '\"') { putchar('\n'); goto QUOTED_FIELD; }
if (c == fs) { printf("\n%c",ofs); goto NOT_FIELD; }
/* default */ { printf("\n%c",c); goto UNQUOTED_FIELD; }
}
/* main -- process command line, call appropriate conversion */
int main(int argc, char *argv[]) {
char ofs = '\034'; /* output field separator */
char fs = ','; /* input field separator */
int status = 0; /* error status for return to operating system */
char *progname; /* name of program for error messages */
FILE *f;
int i;
progname = (char *) malloc(strlen(argv[0])+1);
strcpy(progname, argv[0]);
while (argc > 1 && argv[1][0] == '-') {
switch (argv[1][1]) {
case 'c':
case 'C':
fs = argv[1][2];
break;
case 'f':
case 'F':
ofs = argv[1][2];
break;
default:
fprintf(stderr, "%s: unknown argument %s\n",
progname, argv[1]);
fprintf(stderr,
"usage: %s [-Cc] [-Fc] [filename ...]\n",
progname);
exit(1);
}
argc--;
argv++;
}
if (argc == 1)
dofile(ofs, fs, stdin);
else
for (i = 1; i < argc; i++)
if ((f = fopen(argv[i], "r")) == NULL) {
fprintf(stderr, "%s: can't open %s\n",
progname, argv[i]);
status = 2;
} else {
dofile(ofs, fs, f);
fclose(f);
}
exit(status);
}
/*名称
*
*csv--将逗号分隔的值文件转换为字符分隔的值
*
*
*概要
*
*csv[-Cc][-Fc][filename…]
*
*
*描述
*
*Csv从标准输入或上命名的一个或多个文件中读取
*命令行以逗号分隔的值显示一系列记录
*格式化并在标准输出上写入相同的字符记录-
*分隔格式.Csv成功返回0,选项错误返回1,
*如果有任何文件无法打开,则为2。
*
*逗号分隔值格式是随着时间的推移而发展起来的
*一组从未正式定义过的约定,以及
*实现在某些细节上存在冲突
*一般情况下,数据库使用逗号分隔值格式,
*电子表格和其他需要写入数据的程序
*包含字段的记录。数据以ascii文本形式写入,
*记录以换行符和包含零或
*由逗号分隔的更多字符。中的前导和尾随空格
*未加引号的字段被保留。字段可以用双引号括起来-
*引号字符(ascii\042);此类字段可能包含换行符,
*文字逗号(ascii\054)和双引号字符
*表示为两个连续的双引号
*下文阐明了可能出现的许多不正常情况。
*
*字段分隔符通常是逗号,但可以更改为逗号
*带有命令行选项-Cc的任意字符c。这是
*适用于使用逗号而不是逗号的欧洲国家
*小数点,其中字段分隔符通常更改为
*分号。
*
*字符分隔格式具有以换行符和结尾的记录
*由单个字符分隔的字段,默认为\034
*但是可以使用命令行上的-Fc选项进行更改。
*
*
*范例
*
*下面的每条记录有五个字段-
*字符序列制表符表示单个制表符(ascii)
* \011).
*
*$cat testdata.csv
*1、abc、def ghi、jkl、无引号字符串
*2、“abc”、“def ghi”、“jkl”,带引号的字符串
*3123456789,数字
*4、abc、def、ghi、带空格的字符串
*5、“abc”、“def”、“ghi”,带空格的带引号的字符串
*6123456789,带空格的数字
*7、tab123456tab、TAB789TAB、带制表符的数字表示空白
*8、-123、+456、1E3,更多带空格的数字
*9123456123“456123456,奇怪的数字
*10,abc”,de“f,g”高,嵌入报价
*11、“abc”、“de”f、“g”hi,引用嵌入引号
*12、“”、“”、“”、“”x“”,双引号
*13、“abc”定义,abc“定义”,“abc”定义,奇怪的引号
*14、、“”、、空字段
*15,abc,“def
*ghi“,jkl,嵌入式换行符
*16,abc,“def”,789,多种字段类型
*
*$csv-F'|'testdata.csv
*1 | abc | def ghi | jkl |无引号字符串
*2 | abc | def ghi | jkl |带引号的字符串
*3 | 123 | 456 | 789 |个
*带空格的4 | abc | def | ghi |字符串
*5 |“abc”| def |“ghi”|带空格的带引号的字符串
*6 | 123 | 456 | 789 |带空格的数字
*7 | TAB123 | 456TAB | TAB789TAB |带空格制表符的数字
*8 |-123 |+456 | 1E3 |更多带空格的数字
*9 | 123 456 | 123“456 | 123 456 |奇怪的数字
*10 | abc | de | f | g | hi |嵌入报价
*11 | abc | de | f | g | hi |引用嵌入报价
*12 | | | x |双引号
*13 | abcdef | abc“def”| abc“def”|奇怪的引语
*14 | | | |空字段
*15 | abc | def
*ghi | jkl |嵌入式新线
*16 | abc | def | 789 |多种类型的字段
*
*通过管道将csv的输出传输到任何
*