如何使用PowerShell拆分文本文件?

如何使用PowerShell拆分文本文件?,powershell,Powershell,我需要将一个大的(500MB)文本文件(一个log4net异常文件)拆分成可管理的块,比如1005MB文件就可以了 我想这应该是一次在公园里散步的机会。如何执行此操作?对于PowerShell来说,这是一项比较简单的任务,因为标准的Get-Content cmdlet不能很好地处理非常大的文件,这一事实使任务变得复杂。我建议您使用.NET在PowerShell脚本中逐行读取文件,并使用Add Contentcmdlet将每一行写入文件名中索引不断增加的文件。大概是这样的: $upperBound

我需要将一个大的(500MB)文本文件(一个log4net异常文件)拆分成可管理的块,比如1005MB文件就可以了


我想这应该是一次在公园里散步的机会。如何执行此操作?

对于PowerShell来说,这是一项比较简单的任务,因为标准的Get-Content cmdlet不能很好地处理非常大的文件,这一事实使任务变得复杂。我建议您使用.NET在PowerShell脚本中逐行读取文件,并使用
Add Content
cmdlet将每一行写入文件名中索引不断增加的文件。大概是这样的:

$upperBound = 50MB # calculated by Powershell
$ext = "log"
$rootName = "log_"

$reader = new-object System.IO.StreamReader("C:\Exceptions.log")
$count = 1
$fileName = "{0}{1}.{2}" -f ($rootName, $count, $ext)
while(($line = $reader.ReadLine()) -ne $null)
{
    Add-Content -path $fileName -value $line
    if((Get-ChildItem -path $fileName).Length -ge $upperBound)
    {
        ++$count
        $fileName = "{0}{1}.{2}" -f ($rootName, $count, $ext)
    }
}

$reader.Close()

我经常需要做同样的事情。诀窍是将标题重复到每个分割的块中。我编写了以下cmdlet(PowerShell v2 CTP 3),它实现了这一点

##############################################################################
#.SYNOPSIS
# Breaks a text file into multiple text files in a destination, where each
# file contains a maximum number of lines.
#
#.DESCRIPTION
# When working with files that have a header, it is often desirable to have
# the header information repeated in all of the split files. Split-File
# supports this functionality with the -rc (RepeatCount) parameter.
#
#.PARAMETER Path
# Specifies the path to an item. Wildcards are permitted.
#
#.PARAMETER LiteralPath
# Specifies the path to an item. Unlike Path, the value of LiteralPath is
# used exactly as it is typed. No characters are interpreted as wildcards.
# If the path includes escape characters, enclose it in single quotation marks.
# Single quotation marks tell Windows PowerShell not to interpret any
# characters as escape sequences.
#
#.PARAMETER Destination
# (Or -d) The location in which to place the chunked output files.
#
#.PARAMETER Count
# (Or -c) The maximum number of lines in each file.
#
#.PARAMETER RepeatCount
# (Or -rc) Specifies the number of "header" lines from the input file that will
# be repeated in each output file. Typically this is 0 or 1 but it can be any
# number of lines.
#
#.EXAMPLE
# Split-File bigfile.csv 3000 -rc 1
#
#.LINK 
# Out-TempFile
##############################################################################
function Split-File {

    [CmdletBinding(DefaultParameterSetName='Path')]
    param(

        [Parameter(ParameterSetName='Path', Position=1, Mandatory=$true, ValueFromPipeline=$true, ValueFromPipelineByPropertyName=$true)]
        [String[]]$Path,

        [Alias("PSPath")]
        [Parameter(ParameterSetName='LiteralPath', Mandatory=$true, ValueFromPipelineByPropertyName=$true)]
        [String[]]$LiteralPath,

        [Alias('c')]
        [Parameter(Position=2,Mandatory=$true)]
        [Int32]$Count,

        [Alias('d')]
        [Parameter(Position=3)]
        [String]$Destination='.',

        [Alias('rc')]
        [Parameter()]
        [Int32]$RepeatCount

    )

    process {

        # yeah! the cmdlet supports wildcards
        if ($LiteralPath) { $ResolveArgs = @{LiteralPath=$LiteralPath} }
        elseif ($Path) { $ResolveArgs = @{Path=$Path} }

        Resolve-Path @ResolveArgs | %{

            $InputName = [IO.Path]::GetFileNameWithoutExtension($_)
            $InputExt  = [IO.Path]::GetExtension($_)

            if ($RepeatCount) { $Header = Get-Content $_ -TotalCount:$RepeatCount }

            # get the input file in manageable chunks

            $Part = 1
            Get-Content $_ -ReadCount:$Count | %{

                # make an output filename with a suffix
                $OutputFile = Join-Path $Destination ('{0}-{1:0000}{2}' -f ($InputName,$Part,$InputExt))

                # In the first iteration the header will be
                # copied to the output file as usual
                # on subsequent iterations we have to do it
                if ($RepeatCount -and $Part -gt 1) {
                    Set-Content $OutputFile $Header
                }

                # write this chunk to the output file
                Write-Host "Writing $OutputFile"
                Add-Content $OutputFile $_

                $Part += 1

            }

        }

    }

}

我根据每个部分的大小对分割文件做了一些修改

##############################################################################
#.SYNOPSIS
# Breaks a text file into multiple text files in a destination, where each
# file contains a maximum number of lines.
#
#.DESCRIPTION
# When working with files that have a header, it is often desirable to have
# the header information repeated in all of the split files. Split-File
# supports this functionality with the -rc (RepeatCount) parameter.
#
#.PARAMETER Path
# Specifies the path to an item. Wildcards are permitted.
#
#.PARAMETER LiteralPath
# Specifies the path to an item. Unlike Path, the value of LiteralPath is
# used exactly as it is typed. No characters are interpreted as wildcards.
# If the path includes escape characters, enclose it in single quotation marks.
# Single quotation marks tell Windows PowerShell not to interpret any
# characters as escape sequences.
#
#.PARAMETER Destination
# (Or -d) The location in which to place the chunked output files.
#
#.PARAMETER Size
# (Or -s) The maximum size of each file. Size must be expressed in MB.
#
#.PARAMETER RepeatCount
# (Or -rc) Specifies the number of "header" lines from the input file that will
# be repeated in each output file. Typically this is 0 or 1 but it can be any
# number of lines.
#
#.EXAMPLE
# Split-File bigfile.csv -s 20 -rc 1
#
#.LINK 
# Out-TempFile
##############################################################################
function Split-File {

    [CmdletBinding(DefaultParameterSetName='Path')]
    param(

        [Parameter(ParameterSetName='Path', Position=1, Mandatory=$true, ValueFromPipeline=$true, ValueFromPipelineByPropertyName=$true)]
        [String[]]$Path,

        [Alias("PSPath")]
        [Parameter(ParameterSetName='LiteralPath', Mandatory=$true, ValueFromPipelineByPropertyName=$true)]
        [String[]]$LiteralPath,

        [Alias('s')]
        [Parameter(Position=2,Mandatory=$true)]
        [Int32]$Size,

        [Alias('d')]
        [Parameter(Position=3)]
        [String]$Destination='.',

        [Alias('rc')]
        [Parameter()]
        [Int32]$RepeatCount

    )

    process {

  # yeah! the cmdlet supports wildcards
        if ($LiteralPath) { $ResolveArgs = @{LiteralPath=$LiteralPath} }
        elseif ($Path) { $ResolveArgs = @{Path=$Path} }

        Resolve-Path @ResolveArgs | %{

            $InputName = [IO.Path]::GetFileNameWithoutExtension($_)
            $InputExt  = [IO.Path]::GetExtension($_)

            if ($RepeatCount) { $Header = Get-Content $_ -TotalCount:$RepeatCount }

   Resolve-Path @ResolveArgs | %{

    $InputName = [IO.Path]::GetFileNameWithoutExtension($_)
    $InputExt  = [IO.Path]::GetExtension($_)

    if ($RepeatCount) { $Header = Get-Content $_ -TotalCount:$RepeatCount }

    # get the input file in manageable chunks

    $Part = 1
    $buffer = ""
    Get-Content $_ -ReadCount:1 | %{

     # make an output filename with a suffix
     $OutputFile = Join-Path $Destination ('{0}-{1:0000}{2}' -f ($InputName,$Part,$InputExt))

     # In the first iteration the header will be
     # copied to the output file as usual
     # on subsequent iterations we have to do it
     if ($RepeatCount -and $Part -gt 1) {
      Set-Content $OutputFile $Header
     }

     # test buffer size and dump data only if buffer is greater than size
     if ($buffer.length -gt ($Size * 1MB)) {
      # write this chunk to the output file
      Write-Host "Writing $OutputFile"
      Add-Content $OutputFile $buffer
      $Part += 1
      $buffer = ""
     } else {
      $buffer += $_ + "`r"
     }
    }
   }
        }
    }
}

我在尝试将单个vCard VCF文件中的多个联系人拆分为单独的文件时发现了这个问题。这是我根据李的代码所做的。我必须查看如何创建新的StreamReader对象,并将null更改为$null

$reader = new-object System.IO.StreamReader("C:\Contacts.vcf")
$count = 1
$filename = "C:\Contacts\{0}.vcf" -f ($count) 

while(($line = $reader.ReadLine()) -ne $null)
{
    Add-Content -path $fileName -value $line

    if($line -eq "END:VCARD")
    {
        ++$count
        $filename = "C:\Contacts\{0}.vcf" -f ($count)
    }
}

$reader.Close()

关于一些现有答案的一点警告——对于非常大的文件,它们运行速度会非常慢。对于一个1.6GB的日志文件,我在几个小时后就放弃了,因为我意识到在第二天返回工作之前它不会完成

两个问题:调用打开、查找并关闭源文件中每一行的当前目标文件。每次读一点源文件并寻找新行也会减慢速度,但我猜添加内容是主要原因

下面的变体产生稍微不愉快的输出:它将在中间行分割文件,但是它在不到一分钟内分裂了1.6GB日志:

$from = "C:\temp\large_log.txt"
$rootName = "C:\temp\large_log_chunk"
$ext = "txt"
$upperBound = 100MB


$fromFile = [io.file]::OpenRead($from)
$buff = new-object byte[] $upperBound
$count = $idx = 0
try {
    do {
        "Reading $upperBound"
        $count = $fromFile.Read($buff, 0, $buff.Length)
        if ($count -gt 0) {
            $to = "{0}.{1}.{2}" -f ($rootName, $idx, $ext)
            $toFile = [io.file]::OpenWrite($to)
            try {
                "Writing $count to $to"
                $tofile.Write($buff, 0, $count)
            } finally {
                $tofile.Close()
            }
        }
        $idx ++
    } while ($count -gt 0)
}
finally {
    $fromFile.Close()
}
还有一种快速(而且有些脏)的单衬里:

$linecount=0; $i=0; Get-Content .\BIG_LOG_FILE.txt | %{ Add-Content OUT$i.log "$_"; $linecount++; if ($linecount -eq 3000) {$I++; $linecount=0 } }
    $linecount=0; $i=0; 
    Get-Content .\BIG_LOG_FILE.txt | %
    { 
      Add-Content OUT$i.log "$_"; 
      $linecount++; 
      if ($linecount -eq 3000) {$I++; $linecount=0 } 
    }
您可以通过更改硬编码的3000值来调整每批的第一行数。

执行以下操作:

Get-Content C:\TEMP\DATA\split\splitme.txt | Select -First 5000 | out-File C:\temp\file1.txt -Encoding ASCII
文件1

还有一种快速(而且有些脏)的单衬里:

$linecount=0; $i=0; Get-Content .\BIG_LOG_FILE.txt | %{ Add-Content OUT$i.log "$_"; $linecount++; if ($linecount -eq 3000) {$I++; $linecount=0 } }
    $linecount=0; $i=0; 
    Get-Content .\BIG_LOG_FILE.txt | %
    { 
      Add-Content OUT$i.log "$_"; 
      $linecount++; 
      if ($linecount -eq 3000) {$I++; $linecount=0 } 
    }
通过更改硬编码的3000值,可以调整每批的第一行数

Get-Content C:\TEMP\DATA\split\splitme.txt | Select -First 5000 | out-File C:\temp\file1.txt -Encoding ASCII
文件2

Get-Content C:\TEMP\DATA\split\splitme.txt | Select -Skip 5000 | Select -First 5000 | out-File C:\temp\file2.txt -Encoding ASCII
文件3

Get-Content C:\TEMP\DATA\split\splitme.txt | Select -Skip 10000 | Select -First 5000 | out-File C:\temp\file3.txt -Encoding ASCII

等等。

根据行数(本例中为100行)简单拆分一行:


对于我的源文件来说,这些答案中的许多都太慢了。我的源文件是介于10MB和800MB之间的SQL文件,需要将它们拆分为大致相等行数的文件

我发现以前使用添加内容的一些答案相当慢。等待数小时以完成拆分并不罕见

我没有尝试,但它看起来只按文件大小进行拆分,而不是按行数进行拆分

以下内容符合我的目的

$sw = new-object System.Diagnostics.Stopwatch
$sw.Start()
Write-Host "Reading source file..."
$lines = [System.IO.File]::ReadAllLines("C:\Temp\SplitTest\source.sql")
$totalLines = $lines.Length

Write-Host "Total Lines :" $totalLines

$skip = 0
$count = 100000; # Number of lines per file

# File counter, with sort friendly name
$fileNumber = 1
$fileNumberString = $filenumber.ToString("000")

while ($skip -le $totalLines) {
    $upper = $skip + $count - 1
    if ($upper -gt ($lines.Length - 1)) {
        $upper = $lines.Length - 1
    }

    # Write the lines
    [System.IO.File]::WriteAllLines("C:\Temp\SplitTest\result$fileNumberString.txt",$lines[($skip..$upper)])

    # Increment counters
    $skip += $count
    $fileNumber++
    $fileNumberString = $filenumber.ToString("000")
}

$sw.Stop()

Write-Host "Split complete in " $sw.Elapsed.TotalSeconds "seconds"
对于一个54MB的文件,我得到了输出

Reading source file...
Total Lines : 910030
Split complete in  1.7056578 seconds

我希望其他寻找符合我要求的简单、基于行的拆分脚本的人会发现这很有用。

与这里的所有答案相同,但使用StreamReader/StreamWriter在新行上拆分(逐行拆分,而不是一次将整个文件读入内存)。这种方法可以以我所知道的最快的方式分割大文件

注意:我很少做错误检查,所以我不能保证它能顺利地为您的案例工作。它为我做了(1.7GB的TXT文件,400万行,在95秒内分割成100000行)

#拆分测试
$sw=新对象系统.Diagnostics.Stopwatch
$sw.Start()
$filename=“C:\Users\Vincent\Desktop\test.txt”
$rootName=“C:\Users\Vincent\Desktop\result”
$ext=“.txt”
$linesperFile=100000#100k
$filecount=1
$reader=$null
试一试{
$reader=[io.file]::OpenText($filename)
试一试{
“正在创建文件号$filecount”
$writer=[io.file]::CreateText({0}{1}.{2})-f($rootName,$filecount.ToString(“000”),$ext))
$filecount++
$linecount=0
而($reader.EndOfStream-ne$true){
“正在读取$linesperFile”
而($linecount-lt$linesperFile)-和($reader.EndOfStream-ne$true)){
$writer.WriteLine($reader.ReadLine());
$linecount++
}
if($reader.EndOfStream-ne$true){
“正在关闭文件”
$writer.Dispose();
“正在创建文件号$filecount”
$writer=[io.file]::CreateText({0}{1}.{2})-f($rootName,$filecount.ToString(“000”),$ext))
$filecount++
$linecount=0
}
}
}最后{
$writer.Dispose();
}
}最后{
$reader.Dispose();
}
$sw.Stop()
在“$sw.eassed.TotalSeconds”秒数内写入主机“拆分完成”
拆分1.7GB文件的输出:

...
Creating file number 45
Reading 100000
Closing file
Creating file number 46
Reading 100000
Closing file
Creating file number 47
Reading 100000
Closing file
Creating file number 48
Reading 100000
Split complete in  95.6308289 seconds

我的要求有点不同。我经常使用逗号分隔和制表符分隔的ASCII文件,其中一行是一条数据记录。它们非常大,所以我需要将它们分成可管理的部分(同时保留标题行)

因此,我回到了我的经典VBScript方法,并拼凑了一个可以在任何Windows计算机上运行的小.vbs脚本(它由Windows上的WScript.exe脚本主机引擎自动执行)

这种方法的好处是它使用文本流,因此底层数据不会加载到内存中(或者至少不会一次加载所有数据)。其结果是,它的速度非常快,而且运行时不需要太多内存。我刚才在i7上使用此脚本拆分的测试文件大小约为1GB,有大约1200万行文本,并被拆分为25个部分文件(每个部分文件大约有500k行)-处理过程大约需要2分钟,并且在任何时候都没有超过3MB的内存

这里需要注意的是,它依赖于具有“行”的文本文件(即每个记录用CRLF分隔),因为文本流对象使用“ReadLine”函数一次处理一行。但是,如果你正在使用TSV或CSV文件,那就太完美了

Option Explicit

Private Const INPUT_TEXT_FILE = "c:\bigtextfile.txt"  
Private Const REPEAT_HEADER_ROW = True                
Private Const LINES_PER_PART = 500000                 

Dim oFileSystem, oInputFile, oOutputFile, iOutputFile, iLineCounter, sHeaderLine, sLine, sFileExt, sStart

sStart = Now()

sFileExt = Right(INPUT_TEXT_FILE,Len(INPUT_TEXT_FILE)-InstrRev(INPUT_TEXT_FILE,".")+1)
iLineCounter = 0
iOutputFile = 1

Set oFileSystem = CreateObject("Scripting.FileSystemObject")
Set oInputFile = oFileSystem.OpenTextFile(INPUT_TEXT_FILE, 1, False)
Set oOutputFile = oFileSystem.OpenTextFile(Replace(INPUT_TEXT_FILE, sFileExt, "_" & iOutputFile & sFileExt), 2, True)

If REPEAT_HEADER_ROW Then
    iLineCounter = 1
    sHeaderLine = oInputFile.ReadLine()
    Call oOutputFile.WriteLine(sHeaderLine)
End If

Do While Not oInputFile.AtEndOfStream
    sLine = oInputFile.ReadLine()
    Call oOutputFile.WriteLine(sLine)
    iLineCounter = iLineCounter + 1
    If iLineCounter Mod LINES_PER_PART = 0 Then
        iOutputFile = iOutputFile + 1
        Call oOutputFile.Close()
        Set oOutputFile = oFileSystem.OpenTextFile(Replace(INPUT_TEXT_FILE, sFileExt, "_" & iOutputFile & sFileExt), 2, True)
        If REPEAT_HEADER_ROW Then
            Call oOutputFile.WriteLine(sHeaderLine)
        End If
    End If
Loop

Call oInputFile.Close()
Call oOutputFile.Close()
Set oFileSystem = Nothing

Call MsgBox("Done" & vbCrLf & "Lines Processed:" & iLineCounter & vbCrLf & "Part Files: " & iOutputFile & vbCrLf & "Start Time: " & sStart & vbCrLf & "Finish Time: " & Now())

听起来像是UNIX的工作
$infile = "D:\Malcolm\Test\patch6.txt"
$path = "D:\Malcolm\Test\"
$lineCount = 1
$fileCount = 1

foreach ($computername in get-content $infile)
{
    write $computername | out-file -Append $path_$fileCount".txt"
    $lineCount++

    if ($lineCount -eq 1000)
    {
        $fileCount++
        $lineCount = 1
    }
}