Shell 如何通过匹配行上下文来拆分文件？_Shell_Unix_Sh

Shell 如何通过匹配行上下文来拆分文件？

shell unix

Shell 如何通过匹配行上下文来拆分文件？,shell,unix,sh,Shell,Unix,Sh,我有一个文件，x，带有节分隔符： The first section #! The second section #! The third section #! The second section #! 我想把它分成一系列独立的文件，比如： The first section 我认为csplit将是一个解决方案，它有一个类似以下的命令行： $ csplit -sk x '/#!/' {9999} 但是第二个文件（xx01）最终包含两个分隔符： The first section

我有一个文件，

，带有节分隔符：

The first section

#!

The second section

#!

The third section

#!

The second section

#!

我想把它分成一系列独立的文件，比如：

The first section

我认为

csplit

将是一个解决方案，它有一个类似以下的命令行：

$ csplit -sk x '/#!/' {9999}

但是第二个文件（

xx01

）最终包含两个分隔符：

The first section

#!

The second section

#!

The third section

#!

The second section

#!

关于如何以兼容POSIX的方式实现我想要的东西，有什么想法吗？（是的，我可以接触Perl/Python/Ruby和朋友；但是，重点是扩展我的shell知识。）

我担心我在OSX csplit中发现了一个bug。人们可以试试下面的方法，让我知道结果吗

#!/bin/sh

test -e

work="$(basename $0).$RANDOM"
mkdir $work

csplit -sk -f "$work/" - '/#/' '{9999}' <<EOF
First
#
Second
#
Third
EOF

if [ $(grep -c '#' $work/01) -eq 2 ]; then
  echo FAIL Repeat
else
  echo PASS Repeat
fi

rm $work/*

csplit -sk -f "$work/" - '/#/' '/#/' <<EOF
First
#
Second
#
Third
EOF

if [ $(grep -c '#' $work/01) -eq 2 ]; then
  echo FAIL Exact
else
  echo PASS Exact
fi

uname -a

在我的Debian box上，我得到：

$ ./csplit-test
csplit: #: no match
FAIL Repeat
PASS Exact
Darwin lani.bigpond 11.2.0 Darwin Kernel Version 11.2.0: Tue Aug  9 20:54:00 PDT 2011; root:xnu-1699.24.8~1/RELEASE_X86_64 x86_64

$ sh ./csplit-test 
csplit: `/#/': match not found on repetition 2
PASS Repeat
PASS Exact

在LINUX上，这似乎对我有效：

csplit -sk filename '/#!/' {*}

给予：

$ more xx00
The first section

$ more xx01
#!

The second section

$ more xx02
#!

The third section

您还可以使用Ruby或Perl在一个小脚本中完成这项工作，并将分隔符一起去掉

在Fedora 13 Linux上：

$ ./test.sh 
csplit: `/#/': match not found on repetition 2
PASS Repeat
PASS Exact
Linux localhost.localdomain 2.6.34.8-68.fc13.x86_64 #1 SMP Thu Feb 17 15:03:58 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux

哦。（FreeBSD 8.1安装在Parallels虚拟机中运行）

src./test\u split.sh
csplit:#：不匹配
失败重复
准确传递
FreeBSD 8.1-RELEASE FreeBSD 8.1-RELEASE#0:Mon Jul 19 02:55:53 UTC 2010root@almeida.cse.buffalo.edu：/usr/obj/usr/src/sys/GENERIC i386

虽然不理想，但您可以使用

awk

执行类似操作

您的文件：

[jaypal:~/Temp] cat f0
The first section

#!

The second section

#!

The third section

[jaypal:~/Temp] cat temp1 
The first section

[jaypal:~/Temp] cat temp2 
#!

The second section

[jaypal:~/Temp] cat temp3 
#!

The third section

在

之前获取所有信息使用此选项（您可以在文件中重定向此选项）
获取#后跟内容，并在下一个#之前拆分
[jaypal:~/Temp] awk '/^#!/{x++}{print >(x".txt")}' f0
[jaypal:~/Temp] ls *.txt
1.txt 2.txt
[jaypal:~/Temp] cat 1.txt 
#!

The second section

[jaypal:~/Temp] cat 2.txt 
#!

The third section

使用perl
，您可能会得到一个简单的方法，使用类似这样的东西-
#!/usr/bin/perl

undef $/;
$_ = <>;
$n = 0;

for $match (split(/(?=#!)/)) {
      open(O, '>temp' . ++$n);
      print O $match;
      close(O);
}

使用awk并在linux机器上进行测试：
我的awk版本：
填充内容：
awk脚本的内容：
运行脚本：
$ awk -f script.awk infile

检查输出：
$ ls [0-9].txt
1.txt  2.txt  3.txt
$ cat 1.txt 
The first section
$ cat 2.txt 
#!

The second section
$ cat 3.txt 
#!

The third section

使用您的精确设置进行测试可以得到您想要的结果。我使用的是csplit（gnucoreutils）8.5，这太糟糕了。我正在使用OSX csplit。我刚刚添加了一个测试。你能运行它并告诉我结果吗？
$ awk --version | head -1
GNU Awk 4.0.0

$ cat infile
The first section

#!

The second section

#!

The third section

$ cat script.awk
BEGIN {
        ## Set 'Input Record Separator' variable.
        RS = "#!";
}

{
        ## Set an integer variable as output file name.
        ++filenum;
}

## For first section.
FNR == 1 {
        ## Remove leading and trailing spaces.
        sub( /^\s+/, "", $0);
        sub( /\s+$/, "", $0);

        ## Print to output file.
        printf "%s\n", $0 > filenum ".txt"
}

## For sections from second one to last one.
FNR > 1 {
        ## Remove trailing spaces.
        sub( /\s+$/, "", $0);

        ## Print to output file.
        printf "%s%s\n", RS, $0 > filenum ".txt"
}

$ awk -f script.awk infile

$ ls [0-9].txt
1.txt  2.txt  3.txt
$ cat 1.txt 
The first section
$ cat 2.txt 
#!

The second section
$ cat 3.txt 
#!

The third section