使用批处理/powershell脚本设置自定义行分隔符_Powershell_Batch File

使用批处理/powershell脚本设置自定义行分隔符

powershell batch-file

使用批处理/powershell脚本设置自定义行分隔符,powershell,batch-file,Powershell,Batch File,我有一个大于1.5GB的大文件，它以“#@#@#”作为行分隔符。在通过Informatica处理之前，我将用CRLF字符替换它。问题是，我的文件中有CR，LF字符，我需要在替换之前去掉它们。我已经找到了几个选项来实现这一点，但是由于大小的原因，我得到了一些内存异常 param ( [string]$Source, [string]$Destination ) echo $Source echo $Destination $Writer = New-Object IO.StreamWr

我有一个大于1.5GB的大文件，它以“#@#@#”作为行分隔符。在通过Informatica处理之前，我将用CRLF字符替换它。问题是，我的文件中有CR，LF字符，我需要在替换之前去掉它们。我已经找到了几个选项来实现这一点，但是由于大小的原因，我得到了一些内存异常

param
(
  [string]$Source,
  [string]$Destination
)

echo $Source
echo $Destination

$Writer = New-Object IO.StreamWriter $Destination
$Writer.Write( [String]::Join("", $(Get-Content $Source)) )
$Writer.Close()

我的问题是，是否要将我的行分隔符设置为“#@#@#”，然后逐行读取文件以删除CR、LF字符。

好的，我的第一次尝试速度太慢了。这是一个很好的解决方案，能够在2分钟48秒内处理1.8 GB的文件：-）

我使用了混合批处理/JScript，因此它从XP开始在任何Windows机器上运行-不需要第三方exe文件，也不需要任何编译

我读写~1MB块。逻辑其实很简单

我将所有\r\n替换为一个空格，将#@替换为\r\n。您可以轻松地更改代码中的字符串值以满足您的需要

fixLines.bat

@if (@X)==(@Y) @end /* Harmless hybrid line that begins a JScript comment

::--- Batch section within JScript comment that calls the internal JScript ----
@echo off
setlocal disableDelayedExpansion

if "%~1" equ "" (
  echo Error: missing input argument
  exit /b 1
)
if "%~2" equ "" (
  set "out=%~f1.new"
) else (
  set "out=%~2"
)

<"%~1" >"%out%" cscript //nologo //E:JScript "%~f0"
if "%~2" equ "" move /y "%out%" "%~1" >nul

exit /b

----- End of JScript comment, beginning of normal JScript  ------------------*/
var delim='#@#@#',
    delimReplace='\r\n',
    nl='\r\n',
    nlReplace=' ',
    pos=0,
    str='';

var delimRegex=new RegExp(delim,"g"),
    nlRegex=new RegExp(nl,"g");

while( !WScript.StdIn.AtEndOfStream ) {
  str=str.substring(pos)+WScript.StdIn.Read(1000000);
  pos=str.lastIndexOf(delim)
  if (pos>=0) {
    pos+=delim.length;
    WScript.StdOut.Write(str.substring(0,pos).replace(nlRegex,nlReplace).replace(delimRegex,delimReplace));
  } else {
    pos=0
  }
}
if (str.length>pos) WScript.StdOut.Write(str.substring(pos).replace(nlRegex,nlReplace));

@if (@X)==(@Y) @end /* Harmless hybrid line that begins a JScript comment

::--- Batch section within JScript comment that calls the internal JScript ----
@echo off
setlocal disableDelayedExpansion

if "%~1" equ "" (
  echo Error: missing input argument
  exit /b 1
)
if "%~2" equ "" (
  set "out=%~f1.new"
) else (
  set "out=%~2"
)

<"%~1" >"%out%" cscript //nologo //E:JScript "%~f0"
if "%~2" equ "" move /y "%out%" "%~1" >nul

exit /b

----- End of JScript comment, beginning of normal JScript  ------------------*/
var delim='#@#@#',
    delimReplace='\r\n',
    nlReplace=' ',
    read=1,
    write=2,
    pos=0,
    char;

while( !WScript.StdIn.AtEndOfStream ) {
  chr=WScript.StdIn.Read(1);
  if (chr==delim.charAt(pos)) {
    if (++pos==delim.length) {
      WScript.StdOut.Write(delimReplace);
      pos=0;
    }
  } else {
    if (pos) {
      WScript.StdOut.Write(delim.substring(0,pos));
      pos=0;
    }
    if (chr=='\n') {
      WScript.StdOut.Write(nlReplace);
    } else if (chr!='\r') {
      WScript.StdOut.Write(chr);
    }
  }
}
if (pos) WScript.StdOut.Write(delim.substring(0,pos));

覆盖原始文件test.txt的步骤

fixLines test.txt

为了好玩，我尝试使用处理1.8GB文件。我认为它不会工作，因为它必须将整个文件加载到内存中。不管计算机中安装了多少内存，JScript的最大字符串大小限制为2GB。我认为还有其他限制因素在起作用

jrepl "\r?\n:#@#@#" " :\r\n" /m /x /t : /f input.txt /o output.txt

命令在5分钟内失败，出现“内存不足”错误。然后我的电脑花了很长时间才从严重的内存滥用中恢复过来

下面是我最初的定制batch/JScript解决方案，它一次读取和写入一个字符

慢速击球

@if (@X)==(@Y) @end /* Harmless hybrid line that begins a JScript comment

::--- Batch section within JScript comment that calls the internal JScript ----
@echo off
setlocal disableDelayedExpansion

if "%~1" equ "" (
  echo Error: missing input argument
  exit /b 1
)
if "%~2" equ "" (
  set "out=%~f1.new"
) else (
  set "out=%~2"
)

<"%~1" >"%out%" cscript //nologo //E:JScript "%~f0"
if "%~2" equ "" move /y "%out%" "%~1" >nul

exit /b

----- End of JScript comment, beginning of normal JScript  ------------------*/
var delim='#@#@#',
    delimReplace='\r\n',
    nl='\r\n',
    nlReplace=' ',
    pos=0,
    str='';

var delimRegex=new RegExp(delim,"g"),
    nlRegex=new RegExp(nl,"g");

while( !WScript.StdIn.AtEndOfStream ) {
  str=str.substring(pos)+WScript.StdIn.Read(1000000);
  pos=str.lastIndexOf(delim)
  if (pos>=0) {
    pos+=delim.length;
    WScript.StdOut.Write(str.substring(0,pos).replace(nlRegex,nlReplace).replace(delimRegex,delimReplace));
  } else {
    pos=0
  }
}
if (str.length>pos) WScript.StdOut.Write(str.substring(pos).replace(nlRegex,nlReplace));

@if (@X)==(@Y) @end /* Harmless hybrid line that begins a JScript comment

::--- Batch section within JScript comment that calls the internal JScript ----
@echo off
setlocal disableDelayedExpansion

if "%~1" equ "" (
  echo Error: missing input argument
  exit /b 1
)
if "%~2" equ "" (
  set "out=%~f1.new"
) else (
  set "out=%~2"
)

<"%~1" >"%out%" cscript //nologo //E:JScript "%~f0"
if "%~2" equ "" move /y "%out%" "%~1" >nul

exit /b

----- End of JScript comment, beginning of normal JScript  ------------------*/
var delim='#@#@#',
    delimReplace='\r\n',
    nlReplace=' ',
    read=1,
    write=2,
    pos=0,
    char;

while( !WScript.StdIn.AtEndOfStream ) {
  chr=WScript.StdIn.Read(1);
  if (chr==delim.charAt(pos)) {
    if (++pos==delim.length) {
      WScript.StdOut.Write(delimReplace);
      pos=0;
    }
  } else {
    if (pos) {
      WScript.StdOut.Write(delim.substring(0,pos));
      pos=0;
    }
    if (chr=='\n') {
      WScript.StdOut.Write(nlReplace);
    } else if (chr!='\r') {
      WScript.StdOut.Write(chr);
    }
  }
}
if (pos) WScript.StdOut.Write(delim.substring(0,pos));

我验证了这三种解决方案都给出了相同的结果。

概念简单，内存效率高，但速度较慢的PowerShell解决方案：这个PowerShell（v2+）解决方案速度很慢，但概念上很简单，您不应该耗尽内存，因为输入行一次处理一行，使用

作为行分隔符

注意：此解决方案结合了两个步骤：

它用单个空格替换原始换行符
它用换行符替换每个
```
@@
```
序列

#创建示例输入文件。
@'
第一行从这里开始
和
到此结束#################3号线跨越
两行#@#@#
'@>文件
#确定输入文件。
$infle='file'
#创建输出文件。
$outFile='out'
$null=新项目-类型文件$outFile
获取内容-分隔符“#@#@#”$infle|%{
添加内容-值（$\u.Replace（`r`n'，“”）。Replace（$sep，）$outFile
}

注:

当您使用
```
-Delimiter
```
时，指定的分隔符将包含在通过管道传递的每个项目中（与默认行为不同，默认的分隔符（换行符）被剥离）
```
添加内容
```
自动将尾部CRLF添加到其输出中（在PSv5+中，这可以通过
```
-NoNewLine
```
抑制）
该方法使用
```
[string]
```
类型的
```
.Replace（）
```
方法，而不是PowerShell灵活的、基于正则表达式的
```
-Replace
```
操作符，因为
```
.Replace（）
```
执行文字替换，速度更快（等效命令为
```
添加内容-值（$替换'\r\n'，''）$outFile
```
也就是说，速度增益可以忽略不计；占用大部分时间的是文件I/O部分）

更快的PowerShell解决方案，可按需编译C#代码比上述PowerShell解决方案快得多

下面是对他的方法的一种修改，即在按需编译的PowerShell脚本中使用C代码
编译速度惊人地快（在2012年末的iMac上大约为0.3秒），使用编译后的代码处理文件可以显著提高性能。
还请注意，每个会话只执行一次编译，因此后续调用不会支付此代价
使用下面打印的脚本处理~1 GB文件（通过重复上述示例文件的内容创建）会产生以下结果：

Compiling... Processing file... Completed: Compilation time: 00:00:00.2343647 File-processing time: 00:00:26.0714467 Total: 00:00:26.3278546
实际应用程序中的执行时间会因许多因素而有所不同，但根据下面评论中提到的@dbenham的计时，按需编译解决方案的速度大约是batch+JScript解决方案的两倍

fast PowerShell解决方案的源代码：

#确定输入和输出文件。 $infle='file' $outFile='out' #获取测量持续时间的当前时间戳。 $dtStart=[datetimeoffset]：：UtcNow #一次读取多少个字符。 # !! 确保该长度至少与最大输入线长度相同。 $kCHUNK_尺寸=1000000 编写主机“编译…” #注意：此语句执行按需编译，但仅限于 #在给定会话中的*首次*调用上。 $tsCompilation=Measure命令{ 添加类型@” 使用制度；使用System.IO；名称空间net.same2u.so { 公共静态类助手 { 公共静态文件（字符串填充、字符串输出文件、字符串sep） { char[]bufChars=新字符[$kCHUNK_SIZE]；使用（var sw=新StreamWriter（输出文件））使用（var sr=新的StreamReader（内嵌）） { int pos=0；bool eof=false； string bufStr，rest=string.Empty；而（！（eof=sr.EndOfStream）| | rest.Length>0） { 如果（eof） { bufStr=休息； } 其他的 { int count=sr.Read（bufChars，0，$kCHUNK_SIZE）； bufStr=rest.Length>0？rest+新字符串（bufChars，0，count）：新字符串（bufChars，0，count）； } 如果（-1==（位置=bufStr.Last