Shell 使用awk和bash拆分带有行号的逗号分隔列表_Shell_Awk

Shell 使用awk和bash拆分带有行号的逗号分隔列表

shell awk

Shell 使用awk和bash拆分带有行号的逗号分隔列表,shell,awk,Shell,Awk,我有一个（非常大的）csv文件，格式如下： id;surname;firstname;aliases 1;Simpson;Homer;Homer Jay Simpson,Homer J. Simpson 2;Simpson;Bart;Bartholomew JoJo Simpson,Bartholomew Simpson 3;Krusty the Clown;;Herschel Shmoikel Pinchas Yerucham Krustofsky 4;Simpson;Lisa; id;na

我有一个（非常大的）csv文件，格式如下：

id;surname;firstname;aliases
1;Simpson;Homer;Homer Jay Simpson,Homer J. Simpson
2;Simpson;Bart;Bartholomew JoJo Simpson,Bartholomew Simpson
3;Krusty the Clown;;Herschel Shmoikel Pinchas Yerucham Krustofsky
4;Simpson;Lisa;

id;name
1;Homer Simpson
1_1;Homer Jay Simpson
1_2;Homer J. Simpson
2;Bart Simpson
2_1;Bartholomew JoJo Simpson
2_2;Bartholomew Simpson
3;Krusty the Clown
3_1;Herschel Shmoikel Pinchas Yerucham Krustofsky
4;Lisa Simpson

现在我想将其转换为以下格式：

id;surname;firstname;aliases
1;Simpson;Homer;Homer Jay Simpson,Homer J. Simpson
2;Simpson;Bart;Bartholomew JoJo Simpson,Bartholomew Simpson
3;Krusty the Clown;;Herschel Shmoikel Pinchas Yerucham Krustofsky
4;Simpson;Lisa;

id;name
1;Homer Simpson
1_1;Homer Jay Simpson
1_2;Homer J. Simpson
2;Bart Simpson
2_1;Bartholomew JoJo Simpson
2_2;Bartholomew Simpson
3;Krusty the Clown
3_1;Herschel Shmoikel Pinchas Yerucham Krustofsky
4;Lisa Simpson

出于性能原因，我希望使用

awk

或其他UNIX命令行工具来实现这一点

带有

awk-F'；''的{print$1，$3，$2}'

我可以分隔分号分隔的行。但是如何在

awk

中使用

awk

再次拆分逗号分隔的条目？

请尝试以下内容（使用显示的示例编写和测试）

解释：在此处添加上述代码的详细解释

awk '                        ##Starting awk program from here.
BEGIN{                       ##Starting BEGIN section from here.
  FS="[;,]"                  ##Setting field as either semi-colon OR comma for all lines.
  OFS=";"                    ##Setting output field separator semi-colon.
  print "id;name"            ##Printing id;name string before reading Input_file.
}                            ##Closing BLOCK for BEGIN block of this awk program here.
FNR>1{                       ##Checking condition if FNR>1 then do following.
  j=$2~/ /?2:3
  for(i=j;i<=NF;i++){        ##Running a for loop from i=j to till number of fields of line.
    if($i==""){              ##Checking condition if current field value is NULL then do following.
      continue               ##Using continue to take cursor to for loop again here.
    }
    if(i==j){                ##Checking condition if i==3 then do following.
      print $1,$3" "$2       ##Printing first, 3rd,space and 2nd field of line here.
    }
    else{                    ##If above if condition is false then come to this else here.
      print $1"_"++c,$i      ##Printing first field underscore variable c value, value of current field here.
    }
  }
  c=""                       ##Nullifying variable c here.
}
'  Input_file                ##Mentioning Input_file name here.

awk'##从这里启动awk程序。
开始{##从这里开始开始开始部分。
FS=“[；，]”##将字段设置为所有行的分号或逗号。
OFS=“；”##设置输出字段分隔符分号。
在读取输入文件之前，打印“id；name”##打印id；name字符串。
}##关闭此awk程序的BEGIN块。
FNR>1{##检查条件，如果FNR>1，则执行以下操作。
j=$2~/-2:3
对于（i=j；i请尝试以下内容（使用显示的样本书写和测试）
解释：在此处添加上述代码的详细解释
awk '                        ##Starting awk program from here.
BEGIN{                       ##Starting BEGIN section from here.
  FS="[;,]"                  ##Setting field as either semi-colon OR comma for all lines.
  OFS=";"                    ##Setting output field separator semi-colon.
  print "id;name"            ##Printing id;name string before reading Input_file.
}                            ##Closing BLOCK for BEGIN block of this awk program here.
FNR>1{                       ##Checking condition if FNR>1 then do following.
  j=$2~/ /?2:3
  for(i=j;i<=NF;i++){        ##Running a for loop from i=j to till number of fields of line.
    if($i==""){              ##Checking condition if current field value is NULL then do following.
      continue               ##Using continue to take cursor to for loop again here.
    }
    if(i==j){                ##Checking condition if i==3 then do following.
      print $1,$3" "$2       ##Printing first, 3rd,space and 2nd field of line here.
    }
    else{                    ##If above if condition is false then come to this else here.
      print $1"_"++c,$i      ##Printing first field underscore variable c value, value of current field here.
    }
  }
  c=""                       ##Nullifying variable c here.
}
'  Input_file                ##Mentioning Input_file name here.

awk'##从这里启动awk程序。
开始{##从这里开始开始开始部分。
FS=“[；，]”##将字段设置为所有行的分号或逗号。
OFS=“；”##设置输出字段分隔符分号。
在读取输入文件之前，打印“id；name”##打印id；name字符串。
}##关闭此awk程序的BEGIN块。
FNR>1{##检查条件，如果FNR>1，则执行以下操作。
j=$2~/-2:3
for（i=j；iAwk有一个split
函数，可以将字符串拆分为数组
awk-F'；''开始{OFS=FS}
{打印$1，$3”“$2
n=拆分（$4，别名，/，/）
for（i=1；iAwk有一个split
函数，可以将字符串拆分为数组
awk-F'；''开始{OFS=FS}
{打印$1，$3”“$2
n=拆分（$4，别名，/，/）
对于（i=1；i这将按照您在Python3中的意图进行。请注意，我很快键入了它，可以做很多改进。我相信它可能比awk快，但我可能错了。您可以在Linux和Mac中使用time命令测试是否如此
#!/usr/local/bin/python3

import csv
csvr = csv.reader(open('simpsons.csv'), delimiter = ";")

index=0
for row in csvr:
    if index == 0:
        index = index +1
        continue
    print("{};{} {}".format(index,row[2],row[1]))
    sindex=0
    for sitem in row[3].split(','):
        if sitem != "" :
            sindex = sindex + 1
            print("{};{}".format(row[0] + "_" + str(sindex),sitem))
    index = index +1

希望有帮助
编辑：
我生成了一个包含500k行的虚拟列表，并测试了这里用户给出的一些答案，这似乎与Python3和awk之间没有任何重要区别（至少在我用Python3实现的糟糕情况下是如此）
这将按照您在Python 3中的意图进行。请注意，我很快键入了它，可以做很多改进。我相信它可能比awk快，但我可能错了。您可以在Linux和Mac中使用time命令测试是否如此
#!/usr/local/bin/python3

import csv
csvr = csv.reader(open('simpsons.csv'), delimiter = ";")

index=0
for row in csvr:
    if index == 0:
        index = index +1
        continue
    print("{};{} {}".format(index,row[2],row[1]))
    sindex=0
    for sitem in row[3].split(','):
        if sitem != "" :
            sindex = sindex + 1
            print("{};{}".format(row[0] + "_" + str(sindex),sitem))
    index = index +1

希望有帮助
编辑：
我生成了一个包含500k行的虚拟列表，并测试了这里用户给出的一些答案，这似乎与Python3和awk之间没有任何重要区别（至少在我用Python3实现的糟糕情况下是如此）
$cat tst.awk
开始{FS=OFS=“；”}
NR==1{
打印$1，“名称”
下一个
}
{
名称=$3”“$2
gsub（/^++$/，“”，名称）
打印$1，姓名
n=拆分（$NF，别名，/，/）
对于（i=1；i$cat tst.awk
开始{FS=OFS=“；”}
NR==1{
打印$1，“名称”
下一个
}
{
名称=$3”“$2
gsub（/^++$/，“”，名称）
打印$1，姓名
n=拆分（$NF，别名，/，/）
因为（i=1；我相信小丑不见了；那是一只虫子吗？@tripleee，谢谢你，先生，我现在修好了。小丑Krusty不见了；那是一只虫子吗？@tripleee，谢谢你，先生，我现在修好了。如果你想摆脱Krusty前面的空间，试试（$3？$3“：”）
。试试（$3？$3“：”）
如果你想去掉Krusty前面的空格。你已经包括了一个姓但没有名字的案例（小丑Krusty），如果你可以有相反的名字，那么你也应该包括在你的示例中。你已经包括了一个姓但没有名字的案例（小丑Krusty），如果你可以有相反的结果，那么你也应该在你的例子中包括它。为了你的计时结果-每个脚本的第三次运行计时是否消除了缓存ing的影响？@EdMorton我实际上只运行了一次每个脚本。在这个上下文中是否有缓存？是的，由于缓存ing，后续运行可能比初始运行快，因此你必须ys运行任何命令3次，然后再进行第3次执行计时，以与任何其他命令的计时进行比较（您也会运行3次）。太棒了。我不知道有一个进程级缓存。它与CPU缓存有关吗？你有任何与此相关的链接吗？Idk它与什么有关，我只是在谷歌上搜索参考，但找不到一个，抱歉。我可能会尝试几分钟的谷歌搜索，但这是一件很难查询的事情，因为它会生成许多不相关的点击。有一个提到FWIW的效果。对于您的计时结果-每个脚本的第三次运行计时是为了消除缓存ing的影响吗？@EdMorton我实际上只运行了一次每个脚本。在这个上下文中是否有缓存？是的，由于缓存ing，后续运行可能比初始运行快，因此在执行该3r之前，您必须始终运行任何命令3次d执行时间，用于与任何其他命令的时间进行比较（您也会运行3次）。太棒了。我不知道有一个进程级缓存。它与CPU缓存有关吗？你有任何与此相关的链接吗？Idk它与什么有关，我只是在谷歌上搜索参考，但找不到一个，抱歉。我可能会尝试几分钟的谷歌搜索，但这是一件很难查询的事情，因为它会生成许多不相关的点击。有一个提到FWIW的影响。