Perl 如何在不截断记录的情况下将一个大文本文件拆分为大小相等的部分？_Perl_Unix

Perl 如何在不截断记录的情况下将一个大文本文件拆分为大小相等的部分？

perl unix

Perl 如何在不截断记录的情况下将一个大文本文件拆分为大小相等的部分？,perl,unix,Perl,Unix,我有一个大的文本文件（大约10GB），其中包含大量的故事。每个故事都以标记$$开始。以下是该文件的示例： $$ AA This is story 1 BB 345 $$ AA This is story 2 BB 456 我想把这个文件分成大约250 MB大小的几部分。但是这些故事不应该被分成两个不同的文件有人能帮我编写Unix或Perl代码吗？正是您想要的。它的作用与分割相同，但基于模式 C++中的备选方案（未测试）： #包括 #包括 #包括 #包括 #包括作废新的_输出_文件（b

我有一个大的文本文件（大约10GB），其中包含大量的故事。每个故事都以标记

$$

开始。以下是该文件的示例：

$$
AA This is story 1
BB 345

$$

AA This is story 2
BB 456

我想把这个文件分成大约250 MB大小的几部分。但是这些故事不应该被分成两个不同的文件

有人能帮我编写Unix或Perl代码吗？

正是您想要的。它的作用与分割相同，但基于模式

<> C++中的备选方案（未测试）：

#包括
#包括
#包括
#包括
#包括
作废新的_输出_文件（boost:：shared_ptr&out，const char*前缀）
{
静态int i=0；
std:：ostringstream文件名；
文件名使用严格；
使用警告；
使用自动模具；
$/=“\$\$\n”；
my$targetsize=250*1024*1024；
我的$fileprefix='chunk'；
我的$outfile=0；
我的$outph；
我的$outsize=0；
而（我的$story=）{
chomp（$故事）；
下一步除非$story；#忽略初始空块
$story=“$/$story”；
#还没有打开任何文件，或者这个故事使我们离目标大小更远
如果（！$outfile | | abs（$outsize-$targetsize）”，“$fileprefix$outfile”；
$supersize=0；
}
$outsize+=长度（$story）；
打印$outph$故事；
}
我已经修改了的代码，发现它可以工作。如果您认为可以，请建议您修改此代码以使其更好
use strict;
use warnings;

my $targetsize = 50*1024*1024;
my $fileprefix = 'chunk';
my $outfile = 0;
my $outsize = 0;
my $outfh;
my $temp='';
while (my $line = <>)  {
  chomp($line);
  next unless $line;
  # discard initial empty chunk  
  if($line =~ /^\$\$$/ || $outfile == 0){
        $outsize += length($temp);
        if ( $outfile == 0 || ($outsize - $targetsize) > 0)  { 
              ++$outfile; 
              if($outfh) {close($outfh);}
              open $outfh, '>', "$fileprefix$outfile"; 
              $outsize = 0;
        }
        $temp='';
    }
  $temp = $temp.$line;
  print $outfh "$line\n";  
} 

使用严格；
使用警告；
my$targetsize=50*1024*1024；
我的$fileprefix='chunk'；
我的$outfile=0；
我的$outsize=0；
我的$outph；
我的$temp=''；
while（我的$line=）{
chomp（$line）；
下一行；
#丢弃初始空块
如果（$line=~/^\$\$/| |$outfile==0）{
$OVERSIZE+=长度（$temp）；
如果（$outfile==0 | |（$outsize-$targetsize）>0）{
++$outfile；
if（$outph）{close（$outph）；}
打开$outph，“>”，“$fileprefix$outfile”；
$supersize=0；
}
$temp=''；
}
$temp=$temp.$line；
打印$outph“$line\n”；
} 
似乎根本不符合问题的要求。当然符合。你可以提供一个正则表达式作为分割标准。如果提问者将其设置为/\$\$/那么csplit应该做他们想做的事情。@CanSplice:但目标不是在正则表达式上分割，而是在不分割\$\$\n的情况下每250Mb分割一次chunks.csplit不会这样做。感谢您的代码，但不幸的是，它引发了一些编译时错误。我很乐意编译cpp代码，但不愿意调试它[：P].你能帮我检查一下吗？我还没有测试过这段代码，但是你认为，在这里读取10 GB大小的大文件是可行的吗？程序不会抛出一个错误，说“系统内存不足”？谢谢，我只是想知道autodie在这里有什么用？我评论了这行代码，因为它没有安装在我的系统上，但现在程序正在编写谁chunk1中的le输入文件。@Man:autodie只会在打开失败时使程序死亡，因此不需要代码来显式检查它是否失败；注释它应该不会产生问题。整个输入文件是否超过250Mb？故事之间的分隔符是否不是紧跟着换行符的两个美元符号？谢谢用于解释。我认为，记录分隔符是个问题。当我执行cat-vet test.txt |更多操作时，我会得到类似-$$^M$story1$$^M$story2…如果文件句柄打开，open会自动首先关闭，但显式关闭不会有任何影响。您仅对其长度使用$temp；您可以为每行添加$outsize。您是删除故事中的空行-这是故意的吗？您正在添加的长度不包括换行符，因此会稍微变短；您可以去掉打印中的chomp和\n，并改为使用next if$line eq“\n”。
use strict;
use warnings;
use autodie;

$/ = "\$\$\n";
my $targetsize = 250*1024*1024;
my $fileprefix = 'chunk';
my $outfile = 0;
my $outfh;
my $outsize = 0;
while (my $story = <>) {
    chomp($story);
    next unless $story; # disregard initial empty chunk
    $story = "$/$story";

    # no file open yet, or this story takes us farther from the target size
    if ( ! $outfile || abs($outsize - $targetsize) < abs($outsize + length($story) - $targetsize) ) {
        ++$outfile;
        open $outfh, '>', "$fileprefix$outfile";
        $outsize = 0;
    }

    $outsize += length($story);
    print $outfh $story;
}

use strict;
use warnings;

my $targetsize = 50*1024*1024;
my $fileprefix = 'chunk';
my $outfile = 0;
my $outsize = 0;
my $outfh;
my $temp='';
while (my $line = <>)  {
  chomp($line);
  next unless $line;
  # discard initial empty chunk  
  if($line =~ /^\$\$$/ || $outfile == 0){
        $outsize += length($temp);
        if ( $outfile == 0 || ($outsize - $targetsize) > 0)  { 
              ++$outfile; 
              if($outfh) {close($outfh);}
              open $outfh, '>', "$fileprefix$outfile"; 
              $outsize = 0;
        }
        $temp='';
    }
  $temp = $temp.$line;
  print $outfh "$line\n";  
}