Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/regex/17.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Regex sed-删除大型csv文件中引号内的引号_Regex_Csv_Sed - Fatal编程技术网

Regex sed-删除大型csv文件中引号内的引号

Regex sed-删除大型csv文件中引号内的引号,regex,csv,sed,Regex,Csv,Sed,我正在使用流编辑器sed将一大组文本文件数据(400MB)转换为csv格式 我已接近完成,但突出的问题是引号内的引号,数据如下: 1,word1,"description for word1","another text",""text contains "double quotes" some more text" 2,word2,"description for word2","another text","text may not contain double quotes, but ma

我正在使用流编辑器sed将一大组文本文件数据(400MB)转换为csv格式

我已接近完成,但突出的问题是引号内的引号,数据如下:

1,word1,"description for word1","another text",""text contains "double quotes" some more text"
2,word2,"description for word2","another text","text may not contain double quotes, but may contain commas ,"
3,word3,"description for "word3"","another text","more text and more"
所需输出为:

1,word1,"description for word1","another text","text contains double quotes some more text"
2,word2,"description for word2","another text","text may not contain double quotes, but may contain commas ,"
3,word3,"description for word3","another text","more text and more"
我已经四处寻找帮助,但我并没有太接近解决方案,我尝试了以下带有正则表达式模式的SED:

sed -i 's/(?<!^\s*|,)""(?!,""|\s*$)//g' *.txt
sed -i 's/(?<=[^,])"(?=[^,])//g' *.txt
sed-i的/(?
这将检查类型为
“STR1”STR2“STR3”
的字符串,并将其转换为
“STR1 STR2 STR3”
。如果找到某个字符串,它将重复,以确保在深度>2时消除所有嵌套字符串


它还确保所有STRx都不包含
逗号

以下是使用
GNU awk
和变量的一种方法:

gawk 'BEGIN { FPAT="([^,]+)|(\"[^\"]+\")"; OFS=","; N="\"" } { for (i=1;i<=NF;i++) if ($i ~ /^\".*\"$/) { gsub(/\"/,"", $i); $i=N $i N } }1' file
说明:

使用FPAT,字段被定义为“任何不是 逗号,“或”双引号,任何不是双引号的内容,以及 关闭双引号“,然后在每一行输入上,循环通过每一行 字段,如果该字段以双引号开始和结束,请删除所有 最后,在字段周围添加双引号 场


谢谢,这就快到了,我得到了
1,word1,“word1的描述”,“另一个文本”,“文本包含双引号”更多的文本“
虽然在第一行。你介意解释一下\1\2\3的作用吗?@alinsoar,谢谢你们两位。最后,史蒂夫的回答帮助我以更好的结果完成了它,即使它没有被sed。这个解决方案在Mac OSX Shell(Sierra)上不起作用。@RiccardoDonato:你在使用
gawk
(GNU AWK)?
FPAT
gawk
特定的。@史蒂夫,对不起,你说得对!我用的是awk,我安装了gawk,现在它工作得很好。
gawk 'BEGIN { FPAT="([^,]+)|(\"[^\"]+\")"; OFS=","; N="\"" } { for (i=1;i<=NF;i++) if ($i ~ /^\".*\"$/) { gsub(/\"/,"", $i); $i=N $i N } }1' file
1,word1,"description for word1","another text","text contains double
quotes some more text" 2,word2,"description for word2","another
text","text may not contain double quotes, but may contain commas ,"
3,word3,"description for word3","another text","more text and more"