多行数据:使用Powershell从CSV中删除LF(但不是CRLF)
我有一些CSV数据需要通过删除内联换行符和特殊字符(如排版引号)来清理。我觉得我可以通过Python或Unix UTIL来实现这一点,但我被困在一个非常普通的Windows 2012框中,所以我尝试一下PowerShell v5,尽管我缺乏使用它的经验 以下是我希望实现的目标:多行数据:使用Powershell从CSV中删除LF(但不是CRLF),powershell,csv,newline,Powershell,Csv,Newline,我有一些CSV数据需要通过删除内联换行符和特殊字符(如排版引号)来清理。我觉得我可以通过Python或Unix UTIL来实现这一点,但我被困在一个非常普通的Windows 2012框中,所以我尝试一下PowerShell v5,尽管我缺乏使用它的经验 以下是我希望实现的目标: $InputFile: "INCIDENT_NUMBER","FIRST_NAME","LAST_NAME","DESCRIPTION"{C
$InputFile
:
"INCIDENT_NUMBER","FIRST_NAME","LAST_NAME","DESCRIPTION"{CRLF}
"00020306","John","Davis","Employee was not dressed appropriately."{CRLF}
"00020307","Brad","Miller","Employee told customer, ""Go shop somewhere else!"""{CRLF}
"00020308","Ted","Jones","Employee told supervisor, “That’s not my job”"{CRLF}
"00020309","Bob","Meyers","Employee did the following:{LF}
• Showed up late{LF}
• Did not complete assignments{LF}
• Left work early"{CRLF}
"00020310","John","Davis","Employee was not dressed appropriately."{CRLF}
"INCIDENT_NUMBER","FIRST_NAME","LAST_NAME","DESCRIPTION"{CRLF}
"00020307","Brad","Miller","Employee told customer, ""Go shop somewhere else!"""{CRLF}
"00020308","Ted","Jones","Employee told supervisor, ""That's not my job"""{CRLF}
"00020309","Bob","Meyers","Employee did the following: * Showed up late * Did not complete assignments * Left work early"{CRLF}
"00020310","John","Davis","Employee was not dressed appropriately."{CRLF}
$OutputFile
:
"INCIDENT_NUMBER","FIRST_NAME","LAST_NAME","DESCRIPTION"{CRLF}
"00020306","John","Davis","Employee was not dressed appropriately."{CRLF}
"00020307","Brad","Miller","Employee told customer, ""Go shop somewhere else!"""{CRLF}
"00020308","Ted","Jones","Employee told supervisor, “That’s not my job”"{CRLF}
"00020309","Bob","Meyers","Employee did the following:{LF}
• Showed up late{LF}
• Did not complete assignments{LF}
• Left work early"{CRLF}
"00020310","John","Davis","Employee was not dressed appropriately."{CRLF}
"INCIDENT_NUMBER","FIRST_NAME","LAST_NAME","DESCRIPTION"{CRLF}
"00020307","Brad","Miller","Employee told customer, ""Go shop somewhere else!"""{CRLF}
"00020308","Ted","Jones","Employee told supervisor, ""That's not my job"""{CRLF}
"00020309","Bob","Meyers","Employee did the following: * Showed up late * Did not complete assignments * Left work early"{CRLF}
"00020310","John","Davis","Employee was not dressed appropriately."{CRLF}
以下代码起作用:
(Get-Content $InputFile -Raw) `
-replace '(?<!\x0d)\x0a',' ' `
-replace "[‘’´]","'" `
-replace '[“”]','""' `
-replace "\xa0"," " `
-replace '[•·]','*' | Set-Content $OutputFile -Encoding ASCII
但我的对象上似乎没有notes属性:
Exception setting "notes": "The property 'notes' cannot be found on this object. Verify that the property exists and can be set."
At C:\convert.ps1:53 char:5
+ $_.notes= $_.notes -replace '(?<!\x0d)\x0a',' '
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : NotSpecified: (:) [], SetValueInvocationException
+ FullyQualifiedErrorId : ExceptionWhenSetting
注意:
- 有关强健的解决方案,请参阅我的
- 下面的答案对于性能良好的一般逐行处理解决方案可能仍然有意义,尽管它总是将仅LF实例也视为行分隔符(已更新为使用相同的正则表达式来区分行的开始行和添加到问题的AutoIt解决方案中使用的行的延续行)
考虑到文件的大小,出于性能原因,我建议继续使用纯文本处理:
- 该语句支持快速逐行处理;它将CRLF和LF都识别为换行符,PowerShell通常会这样做。但是,请注意,如果返回的每一行都有其尾部换行符被剥离,则无法判断输入行是否仅以LF的CRLF结尾
- 直接使用.NET类型,绕过管道并启用对输出文件的快速写入
- 有关PowerShell的一般性能提示,请参阅
“^”[^“,]”
,希望它足够健壮(您认为它是,因为您的AutoIt解决方案基于它)
行的开头和后续行的延续之间的这种简单区别避免了为了区分CRLF和LF换行而需要较低级别的文件I/O,而my需要这样做。注意:
- 有关强健的解决方案,请参阅我的
- 下面的答案对于性能良好的一般逐行处理解决方案可能仍然有意义,尽管它总是将仅LF实例也视为行分隔符(已更新为使用相同的正则表达式来区分行的开始行和添加到问题的AutoIt解决方案中使用的行的延续行)
考虑到文件的大小,出于性能原因,我建议继续使用纯文本处理:
- 该语句支持快速逐行处理;它将CRLF和LF都识别为换行符,PowerShell通常会这样做。但是,请注意,如果返回的每一行都有其尾部换行符被剥离,则无法判断输入行是否仅以LF的CRLF结尾
- 直接使用.NET类型,绕过管道并启用对输出文件的快速写入
- 有关PowerShell的一般性能提示,请参阅
“^”[^“,]”
,希望它足够健壮(您认为它是,因为您的AutoIt解决方案基于它)
行的开头和后续行的延续之间的这种简单区别避免了为了区分CRLF和LF换行而需要较低级别的文件I/O,my就是这样做的。第一个答案可能比这更好,因为我不确定PS是否需要以这种方式将所有内容加载到内存中(虽然我认为是的),但从你上面开始,我一直在思考这一点
# Import CSV into a variable
$InputFile = Import-Csv $InputFilePath
# Gets all field names, stores in $Fields
$InputFile | Get-Member -MemberType NoteProperty |
Select-Object Name | Set-Variable Fields
# Updates each field entry
$InputFile | ForEach-Object {
$thisLine = $_
$Fields | ForEach-Object {
($thisLine).($_.Name) = ($thisLine).($_.Name) `
-replace '(?<!\x0d)\x0a',' ' `
-replace "[‘’´]","'" `
-replace '[“”]','""' `
-replace "\xa0"," " `
-replace '[•·]','*'
}
$thisLine | Export-Csv $OutputFile -NoTypeInformation -Encoding ASCII -Append
}
#将CSV导入变量
$InputFile=导入Csv$InputFilePath
#获取所有字段名,存储在$Fields中
$InputFile |获取成员-成员类型NoteProperty |
选择对象名称|设置变量字段
#更新每个字段条目
$InputFile | ForEach对象{
$thisLine=$_
$Fields | ForEach对象{
($thisLine)。($.Name)=($thisLine)。($.Name)`
-替换“(?第一个答案可能比这个好,因为我不确定PS是否需要以这种方式将所有内容加载到内存中(尽管我认为它需要),但是,从上面开始,我一直在思考这一点
# Import CSV into a variable
$InputFile = Import-Csv $InputFilePath
# Gets all field names, stores in $Fields
$InputFile | Get-Member -MemberType NoteProperty |
Select-Object Name | Set-Variable Fields
# Updates each field entry
$InputFile | ForEach-Object {
$thisLine = $_
$Fields | ForEach-Object {
($thisLine).($_.Name) = ($thisLine).($_.Name) `
-replace '(?<!\x0d)\x0a',' ' `
-replace "[‘’´]","'" `
-replace '[“”]','""' `
-replace "\xa0"," " `
-replace '[•·]','*'
}
$thisLine | Export-Csv $OutputFile -NoTypeInformation -Encoding ASCII -Append
}
#将CSV导入变量
$InputFile=导入Csv$InputFilePath
#获取所有字段名,存储在$Fields中
$InputFile |获取成员-成员类型NoteProperty |
选择对象名称|设置变量字段
#更新每个字段条目
$InputFile | ForEach对象{
$thisLine=$_
$Fields | ForEach对象{
($thisLine)。($.Name)=($thisLine)。($.Name)`
-替换“(?这里是另一个“逐行”尝试,有点类似于mklement0的答案。它假设没有“行继续”行以“开头。希望它的性能更好
# Clear contents of file (Not sure if you need/want this...)
if (Test-Path -type leaf $OutputFile) { Clear-Content $OutputFile }
# Flag for first entry, since no data manipulation needed there
$firstEntry = $true
foreach($line in [System.IO.File]::ReadLines($InputFile)) {
if ($firstEntry) {
Add-Content -Path $OutputFile -Value $line -NoNewline
$firstEntry = $false
}
else {
if ($line[0] -eq '"') { Add-Content -Path $OutputFile "`r`n" -NoNewline}
else { Add-Content -Path $OutputFile " " -NoNewline}
$sanitizedLine = $line -replace '(?<!\x0d)\x0a',' ' `
-replace "[‘’´]","'" `
-replace '[“”]','""' `
-replace "\xa0"," " `
-replace '[•·]','*'
Add-Content -Path $OutputFile -Value $sanitizedLine -NoNewline
}
}
#清除文件内容(不确定是否需要/想要此…)
if(测试路径-类型叶$OutputFile){Clear Content$OutputFile}
#第一个条目的标志,因为这里不需要数据操作
$firstEntry=$true
foreach([System.IO.File]中的行)::ReadLines($InputFile)){
如果($firstEntry){
添加内容-路径$OutputFile-值$line-非WLine
$firstEntry=$false
}
否则{
if($line[0]-eq''){Add Content-Path$OutputFile“`r`n”-NoNewline}
else{addcontent-Path$OutputFile”“-NoNewline}
$sanitizedLine=$line-replace'(?这里是另一个“逐行”尝试,有点类似于mklement0的答案。它假设没有“行继续”行以“开头。希望它的性能更好
# Clear contents of file (Not sure if you need/want this...)
if (Test-Path -type leaf $OutputFile) { Clear-Content $OutputFile }
# Flag for first entry, since no data manipulation needed there
$firstEntry = $true
foreach($line in [System.IO.File]::ReadLines($InputFile)) {
if ($firstEntry) {
Add-Content -Path $OutputFile -Value $line -NoNewline
$firstEntry = $false
}
else {
if ($line[0] -eq '"') { Add-Content -Path $OutputFile "`r`n" -NoNewline}
else { Add-Content -Path $OutputFile " " -NoNewline}
$sanitizedLine = $line -replace '(?<!\x0d)\x0a',' ' `
-replace "[‘’´]","'" `
-replace '[“”]','""' `
-replace "\xa0"," " `
-replace '[•·]','*'
Add-Content -Path $OutputFile -Value $sanitizedLine -NoNewline
}
}
#清除文件内容(不确定是否需要/想要此…)
if(测试路径-t