Regex 重型3GB csv数据库的sed/awk处理问题

Regex 重型3GB csv数据库的sed/awk处理问题,regex,text,awk,sed,Regex,Text,Awk,Sed,我被指派负责管理一些旧的LTO磁带数据库,我认为这将是一个很好的机会来构建一个功能库,同时学习一些bash脚本和文本处理。csv数据库大约有3000万行,每行大约3GB。我在使用grep和regex定位行方面已经非常有效,但是现在我想用sed/awk重新格式化整个csv文件,以便更快地处理。这比我预想的要困难得多,我希望一些专家能为我指明正确的方向。csv数据库的格式如下所示: <START OF FILE> AE19T1JA47 - File Name,Directory Nam

我被指派负责管理一些旧的LTO磁带数据库,我认为这将是一个很好的机会来构建一个功能库,同时学习一些bash脚本和文本处理。csv数据库大约有3000万行,每行大约3GB。我在使用grep和regex定位行方面已经非常有效,但是现在我想用sed/awk重新格式化整个csv文件,以便更快地处理。这比我预想的要困难得多,我希望一些专家能为我指明正确的方向。csv数据库的格式如下所示:

<START OF FILE>
AE19T1JA47 -

File Name,Directory Name,Size of File,Time Last Modified

Trash,,0,2013-12-20 13:38:04
RAW FOOTAGE,,0,2013-12-20 13:39:00
DAEDALUS - ARCHIVE - 122013,,0,2013-12-20 13:40:00
STAR_HAFFLEN_PORTER_ROBINSON,DAED3 - ARCHIVE - 122013,0,2013-12-20 13:40:00
STAR_JAPAN_SETTING_SUN_092413,DAED3 - ARCHIVE - 122013,0,2013-12-20 13:40:00
STAR_YTMA_090713,DAED3 - ARCHIVE - 122013,0,2013-12-20 13:40:00
Audio,DAED3 - ARCHIVE - 122013/STAR_BILLYB_PORTER_ROBINSON,0,2013-09-03 11:21:00
Footage,DAED3 - ARCHIVE - 122013/STAR_BILLYB_PORTER_ROBINSON,0,2013-12-20 13:40:00
FWN_ASPERA_TEST_FTG,RAW FOOTAGE,0,2013-12-20 13:40:00
LANA_BRISK_REWSTO_WEEKEND_CASH_121813_RAW,RAW FOOTAGE,0,2013-12-20 13:40:00
LANA_STAR_WORLD_TURNT_LOST_WORLDS_121713_RAW,RAW FOOTAGE,0,2013-12-20 13:40:00
CZECH_PILOTS_ARCHIVAL,RAW FOOTAGE,0,2013-12-20 13:40:00
STAR_CAND_ELVY_121713_RAW,RAW FOOTAGE,0,2013-12-20 13:40:00
STAR_NEWS_PROMOS_PETE_122013_RAW,RAW FOOTAGE,0,2013-12-20 13:40:00
STAR_PODCAST_STEVE_Q_NG_121913_RAW,RAW FOOTAGE,0,2013-12-20 13:40:00
A242_C035_0101MR.RDC,RAW FOOTAGE/FWN_ASPERA_TEST_FTG,0,2013-12-20 13:40:00
md5,RAW FOOTAGE/FWN_FTP_TEST_FTG/A242_C035_0101MR.RDC,0,2013-08-30 08:19:00
MVI_9292.THM,RAW FOOTAGE/STAR_CRANK_ELVY_PROMO_121613_RAW/STAR_CRANK_ELVY_BONES_PROMO_121613_A_01/DCIM/100EOS7D,18687,2013-12-13 17:16:00
._MVI_9293.MOV,RAW FOOTAGE/STAR_CRANK_ELVY_PROMO_121613_RAW/STAR_CRANK_ELVY_BONES_PROMO_121613_A_01/DCIM/100EOS7D,4096,2013-12-20 14:43:00
MVI_9286.THM,RAW FOOTAGE/STAR_CRANK_ELVY_PROMO_121613_RAW/STAR_CRANK_ELVY_BONES_PROMO_121613_A_01/DCIM/100EOS7D,11570,2013-12-13 17:06:00
._MVI_9294.THM,RAW FOOTAGE/STAR_CRANK_ELVY_PROMO_121613_RAW/STAR_CRANK_ELVY_BONES_PROMO_121613_A_01/DCIM/100EOS7D,4096,2013-12-20 14:43:00
MVI_9286.MOV,RAW FOOTAGE/STAR_CRANK_ELVY_PROMO_121613_RAW/STAR_CRANK_ELVY_BONES_PROMO_121613_A_01/DCIM/100EOS7D,387269573,2013-12-13 17:06:00
._.DS_Store,,4096,2013-12-21 16:01:00
.DS_Store,,6148,2013-12-21 16:01:00


AE19T1ML3W -

File Name,Directory Name,Size of File,Time Last Modified

Trash,,0,2013-12-21 16:21:39
DRIVE BACKUPS,,0,2013-12-21 16:27:00
STAR_00112_500GB_BOMBU_REELS,DRIVE BACKUPS,0,2013-12-21 16:27:00
STAR_LANACannesGabby_00106,DRIVE BACKUPS,0,2013-12-21 16:26:00
STAR_01113_1TB_southy_Freeski,DRIVE BACKUPS,0,2013-12-21 16:27:00
STAR 1 TB 31,DRIVE BACKUPS,0,2013-12-21 16:27:00
Media,DRIVE BACKUPS/STAR_00112_500GB_BOMBU_REELS,0,2013-12-21 16:27:00
V_BOMBU_ALLVERSIONS_20131121,DRIVE BACKUPS/STAR_00112_500GB_BOMBU_REELS/Media,0,2013-12-21 16:27:00
tabsz_LOREAL_DELIVERY_082213,DRIVE BACKUPS/STAR_LANACannesGabby_00106,0,2013-12-21 16:27:00
LANA_SIZZLE_REEL_082213,DRIVE BACKUPS/STAR_LANACannesGabby_00106,0,2013-12-21 16:27:00
43_STAR SWSW,DRIVE BACKUPS/STAR_LANACannesGabby_00106,0,2013-12-21 16:27:00
2013-03-16.bbr,DRIVE BACKUPS/STAR_LANACannesGabby_00106,0,2013-12-21 16:27:00
ADDITIONAL_tabsz_FILES,DRIVE BACKUPS/STAR_LANACannesGabby_00106,0,2013-12-21 16:27:00
Autosave Vault,DRIVE BACKUPS/STAR_LANACannesGabby_00106,0,2013-12-21 16:27:00
WADU_SATURDAY,DRIVE BACKUPS/STAR_LANACannesGabby_00106,0,2013-12-21 16:27:00


<END OF FILE>

AE19T1JA47-
文件名、目录名、文件大小、上次修改时间
垃圾箱,02013-12-20 13:38:04
原始进尺,02013-12-20 13:39:00
代达罗斯-档案馆-122013,02013-12-20 13:40:00
DAED3的STAR_HAFFLEN_PORTER_ROBINSON-档案馆-122013,02013-12-20 13:40:00
星空日本日落太阳092413,DAED3-档案-122013,02013-12-20 13:40:00
STAR_YTMA_090713,DAED3-档案-122013,02013-12-20 13:40:00
音频,DAED3-存档-122013/STAR\u BILLYB\u PORTER\u ROBINSON,02013-09-03 11:21:00
录像,DAED3-档案馆-122013/STAR_BILLYB_PORTER_ROBINSON,02013-12-20 13:40:00
FWN_ASPERA_测试FTG,原始进尺,02013-12-20 13:40:00
拉娜·布里斯克·雷夫斯托·周末·现金·121813·原始,原始镜头,02013-12-20 13:40:00
拉娜·明星·世界·特恩特·迷失世界·121713·原始,原始片段,02013-12-20 13:40:00
捷克飞行员档案,原始镜头,02013-12-20 13:40:00
星光大道和精灵大道121713号,原始镜头,02013-12-20 13:40:00
明星新闻宣传片皮特122013原始,原始片段,02013-12-20 13:40:00
明星播客史蒂夫Q吴_121913_原始,原始片段,02013-12-20 13:40:00
A242_C035_0101MR.RDC,原始进尺/FWN_ASPERA_测试FTG,02013-12-20 13:40:00
md5,原始进尺/FWN_FTP_测试FTG/A242_C035_0101MR.RDC,02013-08-30 08:19:00
MVI_9292.THM,原始片段/星光曲柄猫王宣传片121613_原始/星光曲柄猫王骨头宣传片121613_A_01/DCIM/100EOS7D,186872013-12-13 17:16:00
电影,原始片段/星光摇篮猫王宣传片/星光摇篮猫骨宣传片,40962013-12-20 14:43:00
MVI_9286.THM,原始片段/星光曲柄猫王宣传片121613_原始/星光曲柄猫王骨头宣传片121613_A_01/DCIM/100EOS7D,115702013-12-13 17:06:00
THM,原始片段/星光摇篮猫王宣传片121613原始/星光摇篮猫骨宣传片121613 A_01/DCIM/100EOS7D,40962013-12-20 14:43:00
MVI_9286.MOV,原始片段/星光曲柄猫王宣传片121613_原始/星光曲柄猫王骨头宣传片121613_A_01/DCIM/100EOS7D,3872695732013-12-13 17:06:00
地址:40962013-12-21 16:01:00
.DS_商店,61482013-12-21 16:01:00
AE19T1ML3W-
文件名、目录名、文件大小、上次修改时间
垃圾箱,02013-12-2116:21:39
驱动器备份,02013-12-21 16:27:00
STAR_00112_500GB_BOMBU_盘,驱动器备份,02013-12-21 16:27:00
STAR_LANACannesGabby_00106,驱动器备份,02013-12-21 16:26:00
STAR_01113_1TB_southy_Freeski,驱动器备份,02013-12-21 16:27:00
STAR 1 TB 31,驱动器备份,02013-12-21 16:27:00
介质、驱动器备份/STAR_00112_500GB_BOMBU_卷盘,02013-12-21 16:27:00
V_BOMBU_ALLVERSIONS_20131121,驱动器备份/STAR_00112_500GB_BOMBU_卷筒/媒体,02013-12-21 16:27:00
tabsz_LOREAL_交付_082213,驱动器备份/STAR_Lanacanesgabby_00106,02013-12-21 16:27:00
LANA_SIZZLE_REEL_082213,驱动器备份/STAR_LanacanesGabby_00106,02013-12-21 16:27:00
43_STAR SWSW,驱动器备份/STAR_LANACannesGabby_00106,02013-12-21 16:27:00
2013-03-16.bbr,驱动器备份/STAR_LANACannesGabby_00106,02013-12-21 16:27:00
其他选项卡文件、驱动器备份/STAR\u LanacanesGabby\u 00106,02013-12-21 16:27:00
自动保存保险库,驱动器备份/STAR_LanacanesGabby_00106,02013-12-21 16:27:00
星期六,驱动器备份/STAR_LanacannesGabby00106,02013-12-21 16:27:00
按结构分解,每个csv数据库如下所示:

<START OF FILE>
<LTO TAPE NAME><SPACE><DASH>
<NEWLINE>
<TOC LEGEND>
<NEWLINE>
<CONTENTS OF TAPE ABOVE>
<NEWLINE>
<NEWLINE>
<NAME OF NEXT LTO TAPE><SPACE><DASH>
<NEWLINE>
<TOC LEGEND>
<NEWLINE>
<CONTENTS OF TAPE ABOVE>
<NEWLINE>
<NEWLINE>
<END OF FILE>

我想截短整个数据库,方法是取LTO磁带名称,并将其附加到内容行的末尾,以逗号分隔,这样我就可以更容易地看到每个文件所在的磁带。基本上,我想采用上面的结构,并将其重新格式化为:

<START OF FILE>
<TOC LEGEND>
<CONTENTS OF TAPE>,<RESPECTIVE TAPE NAME>
<CONTENTS OF TAPE...>,<RESEPCTIVE TAPE NAME>
...
<END OF FILE>

,
,
...

如果我理解你想正确地做什么,这可能会起作用。这将尝试使用正则表达式查找磁带的名称。如果它找到了与正则表达式匹配的内容,它将在空间上拆分以获得名称。然后,它将查找任何有4个字段但第4个字段没有“上次修改时间”的行,然后打印出该行,并在末尾附加名称数组的第一个值。

将您的示例简化为a,以便我们可以帮助您。看看这是否还不清楚。这既鼓舞人心又实用
awk -F, '{
 {if (/^[A-Z0-9]* -$/)
  {split($1,name," ")}
 else if (NF == 4 && $4 != "Time Last Modified")
  {print $0","name[1]}}}' tape.txt