Text 清洁位置不当的CR和x2B;文本中的LF

Text 清洁位置不当的CR和x2B;文本中的LF,text,replace,notepad++,data-cleaning,code-cleanup,Text,Replace,Notepad++,Data Cleaning,Code Cleanup,我有一个TXT文件,我想导入Excel学习。但是,在导入之前,我正在努力处理文本的格式。这真是一团糟,你可以看到: | 1020941333 | 569|SP |500000343 | 9|18.05.2011|15:27:00|18.05.2011| 18.05.2011|Y-0444871-ENCR | 1,93 |BRL |8000800000 | Juros, Comissões e T |

我有一个TXT文件,我想导入Excel学习。但是,在导入之前,我正在努力处理文本的格式。这真是一团糟,你可以看到:

| 1020941333    |     569|SP    |500000343 | 9|18.05.2011|15:27:00|18.05.2011|
18.05.2011|Y-0444871-ENCR    |           1,93 |BRL  |8000800000  |
Juros, Comissões e T       |                  |           |
                                        |    |          |     |
                     |CLB082902  |     |     |                 |COEL  |COEL  |
Y-0444871               |
| 1020941586    |      43|SP    |500000344 |43|18.05.2011|15:41:43|18.05.2011|
18.05.2011|B-0447039-ENCR    |           9,02 |BRL  |8000800000  |
Juros, Comissões e T       |                  |           |
                                        |    |          |     |
                     |CLB082902  |     |     |                 |COEL  |COEL  |
B-0447039               |
| 1021245920    |     956|SP    |500000489 | 6|14.06.2011|15:24:02|14.06.2011|
14.06.2011|B-0447039-ENCR    |           8,95 |BRL  |8000800000  |
Juros, Comissões e T       |                  |           |
                                        |    |          |     |
                     |CLB082902  |     |     |                 |COEL  |COEL  |
B-0447039    

       |
所以我一直在寻找文本为何如此怪异的答案。我发现这是因为一些CR+LF(回车+换行)的位置不正确。我手动进行了一些更正,通过这些更正,我可以看到可以更好地组织文本,如下所示:

--------------------------------------------------------------------------------
| Nº documento  |     LL.|TpDoc.|Nº doc.ref|LL|Entrado em|Hora    |Data doc. |Dt.lçto.  |Elemento PEP      | Valor/moeda ACC|MdACC|Cl.custo    |Denom.classe custo         |Material          |  Qtd.entr.|Texto breve material                    |UML |Doc.compra| Item|Texto do pedido      |Usuário    |DEs  |Est  |Nº ref.estorno   |Empr. |EmFI  |Definição do projeto
--------------------------------------------------------------------------------
| 1016939462    |       1|WE    |5000058364| 1|22.02.2010|10:52:43|22.02.2010|22.02.2010|Y0444871PROJELMC  |         540,93 |BRL  |8000124000  |Serviço de Terceiro        |                  |     1,000 |                                        |UR  |4501328844|    1|ESTUDOS E PROJ. REDE |CLB055760  |     |     |                 |COEL  |COEL  |Y-0444871               |
| 1020016002    |       1|WE    |5000053667| 1|15.02.2011|11:56:05|15.02.2011|15.02.2011|B0447039PROJELMC  |       2.011,84 |BRL  |8000124000  |Serviço de Terceiro        |                  |     1,000 |                                        |UR  |4501633481|    1|ESTUDOS E PROJ. REDE |CLB093440  |     |     |                 |COEL  |COEL  |B-0447039               |
| 1020258918    |     798|SP    |500000121 | 8|15.03.2011|18:06:18|15.03.2011|15.03.2011|B-0447039-ENCR    |           6,92 |BRL  |8000800000  |Juros, Comissões e T       |                  |           |                                        |    |          |     |                     |CLB107395  |     |     |                 |COEL  |COEL  |B-0447039               |
| 1020585116    |     761|SP    |500000225 | 1|15.04.2011|14:13:44|15.04.2011|15.04.2011|Y-0444871-ENCR    |           1,88 |BRL  |8000800000  |Juros, Comissões e T       |                  |           |                                        |    |          |     |                     |CLB145327  |     |     |                 |COEL  |COEL  |Y-0444871               |
| 1020586939    |     184|SP    |500000230 | 4|15.04.2011|16:22:41|15.04.2011|15.04.2011|B-0447039-ENCR    |           7,03 |BRL  |8000800000  |Juros, Comissões e T       |                  |           |                                        |    |          |     |                     |CLB145327  |     |     |                 |COEL  |COEL  |B-0447039               |
我还可以在文本中看到一种模式。每一行都以这个字符开头。因此,对于不是以´|´开头的每一行,都应该与前一行连接

问题在于:

| 1020941333    |     569|SP    |500000343 | 9|18.05.2011|15:27:00|18.05.2011|
18.05.2011|Y-0444871-ENCR    |           1,93 |BRL  |8000800000  |
Juros, Comissões e T       |                  |           |
                                        |    |          |     |
                     |CLB082902  |     |     |                 |COEL  |COEL  |
Y-0444871               |
| 1020941586    |      43|SP    |500000344 |43|18.05.2011|15:41:43|18.05.2011|
18.05.2011|B-0447039-ENCR    |           9,02 |BRL  |8000800000  |
Juros, Comissões e T       |                  |           |
                                        |    |          |     |
                     |CLB082902  |     |     |                 |COEL  |COEL  |
B-0447039               |
期望输出

| 1020941333    |     569|SP    |500000343 | 9|18.05.2011|15:27:00|18.05.2011|18.05.2011|Y-0444871-ENCR    |           1,93 |BRL  |8000800000  |Juros, Comissões e T       |                  |           |                                        |    |          |     |                     |CLB082902  |     |     |                 |COEL  |COEL  |Y-0444871               |
| 1020941586    |      43|SP    |500000344 |43|18.05.2011|15:41:43|18.05.2011|18.05.2011|B-0447039-ENCR    |           9,02 |BRL  |8000800000  |Juros, Comissões e T       |                  |           |                                        |    |          |     |                     |CLB082902  |     |     |                 |COEL  |COEL  |B-0447039               |

在记事本++中实现它有很多困难。我无法手动执行此操作,因为该文件有超过490万行。如果有人能告诉我一些关于这个问题的信息,使用记事本++或其他更好的软件,我真的很感激

您可以使用正则表达式查找一个管道,后面跟一个换行符,并使用负向前看
(?!
检查管道右侧的内容不是开始新行的模式。然后替换为第一个捕获组以保留管道

查找内容:

\R          # any kind of linebreak (ie. \r, \n, \r\n)
(?!         # negative lookahead, zero length assertion that makes sure we do not have after:
    \|      # a pipe character
)           # end lookahead
| 1020941333    |     569|SP    |500000343 | 9|18.05.2011|15:27:00|18.05.2011|18.05.2011|Y-0444871-ENCR    |           1,93 |BRL  |8000800000  |Juros, Comissões e T       |                  |           |                                        |    |          |     |                     |CLB082902  |     |     |                 |COEL  |COEL  |Y-0444871               |
| 1020941586    |      43|SP    |500000344 |43|18.05.2011|15:41:43|18.05.2011|18.05.2011|B-0447039-ENCR    |           9,02 |BRL  |8000800000  |Juros, Comissões e T       |                  |           |                                        |    |          |     |                     |CLB082902  |     |     |                 |COEL  |COEL  |B-0447039               |
(\\\\)\R(?!\\\\\\[\t]+\d+[\t]+\\\\\)

替换为:

\R          # any kind of linebreak (ie. \r, \n, \r\n)
(?!         # negative lookahead, zero length assertion that makes sure we do not have after:
    \|      # a pipe character
)           # end lookahead
| 1020941333    |     569|SP    |500000343 | 9|18.05.2011|15:27:00|18.05.2011|18.05.2011|Y-0444871-ENCR    |           1,93 |BRL  |8000800000  |Juros, Comissões e T       |                  |           |                                        |    |          |     |                     |CLB082902  |     |     |                 |COEL  |COEL  |Y-0444871               |
| 1020941586    |      43|SP    |500000344 |43|18.05.2011|15:41:43|18.05.2011|18.05.2011|B-0447039-ENCR    |           9,02 |BRL  |8000800000  |Juros, Comissões e T       |                  |           |                                        |    |          |     |                     |CLB082902  |     |     |                 |COEL  |COEL  |B-0447039               |
$1

解释

  • (\ |)
    匹配捕获组中的管道
  • \R
    匹配unicode换行符序列
  • (?!
    负前瞻
    • 匹配管道,1+乘以空格或制表符,1+位数,1+空格或制表符和管道
  • 关闭反向前瞻

请参见

您可以使用正则表达式查找管道,然后是换行符,并使用负向前看
(?!
检查管道右侧的内容不是开始新行的模式。然后替换为第一个捕获组以保留管道

查找内容:

\R          # any kind of linebreak (ie. \r, \n, \r\n)
(?!         # negative lookahead, zero length assertion that makes sure we do not have after:
    \|      # a pipe character
)           # end lookahead
| 1020941333    |     569|SP    |500000343 | 9|18.05.2011|15:27:00|18.05.2011|18.05.2011|Y-0444871-ENCR    |           1,93 |BRL  |8000800000  |Juros, Comissões e T       |                  |           |                                        |    |          |     |                     |CLB082902  |     |     |                 |COEL  |COEL  |Y-0444871               |
| 1020941586    |      43|SP    |500000344 |43|18.05.2011|15:41:43|18.05.2011|18.05.2011|B-0447039-ENCR    |           9,02 |BRL  |8000800000  |Juros, Comissões e T       |                  |           |                                        |    |          |     |                     |CLB082902  |     |     |                 |COEL  |COEL  |B-0447039               |
(\\\\)\R(?!\\\\\\[\t]+\d+[\t]+\\\\\)

替换为:

\R          # any kind of linebreak (ie. \r, \n, \r\n)
(?!         # negative lookahead, zero length assertion that makes sure we do not have after:
    \|      # a pipe character
)           # end lookahead
| 1020941333    |     569|SP    |500000343 | 9|18.05.2011|15:27:00|18.05.2011|18.05.2011|Y-0444871-ENCR    |           1,93 |BRL  |8000800000  |Juros, Comissões e T       |                  |           |                                        |    |          |     |                     |CLB082902  |     |     |                 |COEL  |COEL  |Y-0444871               |
| 1020941586    |      43|SP    |500000344 |43|18.05.2011|15:41:43|18.05.2011|18.05.2011|B-0447039-ENCR    |           9,02 |BRL  |8000800000  |Juros, Comissões e T       |                  |           |                                        |    |          |     |                     |CLB082902  |     |     |                 |COEL  |COEL  |B-0447039               |
$1

解释

  • (\ |)
    匹配捕获组中的管道
  • \R
    匹配unicode换行符序列
  • (?!
    负前瞻
    • 匹配管道,1+乘以空格或制表符,1+位数,1+空格或制表符和管道
  • 关闭反向前瞻

请参见

这将替换任何类型的断线,而不是后面没有任何管道的断线:

  • Ctrl+H
  • 查找内容:
    \R(?)
  • 替换为:
    留空
  • 检查环绕
  • 检查正则表达式
  • 全部替换
说明:

\R          # any kind of linebreak (ie. \r, \n, \r\n)
(?!         # negative lookahead, zero length assertion that makes sure we do not have after:
    \|      # a pipe character
)           # end lookahead
| 1020941333    |     569|SP    |500000343 | 9|18.05.2011|15:27:00|18.05.2011|18.05.2011|Y-0444871-ENCR    |           1,93 |BRL  |8000800000  |Juros, Comissões e T       |                  |           |                                        |    |          |     |                     |CLB082902  |     |     |                 |COEL  |COEL  |Y-0444871               |
| 1020941586    |      43|SP    |500000344 |43|18.05.2011|15:41:43|18.05.2011|18.05.2011|B-0447039-ENCR    |           9,02 |BRL  |8000800000  |Juros, Comissões e T       |                  |           |                                        |    |          |     |                     |CLB082902  |     |     |                 |COEL  |COEL  |B-0447039               |
给定示例的结果:

\R          # any kind of linebreak (ie. \r, \n, \r\n)
(?!         # negative lookahead, zero length assertion that makes sure we do not have after:
    \|      # a pipe character
)           # end lookahead
| 1020941333    |     569|SP    |500000343 | 9|18.05.2011|15:27:00|18.05.2011|18.05.2011|Y-0444871-ENCR    |           1,93 |BRL  |8000800000  |Juros, Comissões e T       |                  |           |                                        |    |          |     |                     |CLB082902  |     |     |                 |COEL  |COEL  |Y-0444871               |
| 1020941586    |      43|SP    |500000344 |43|18.05.2011|15:41:43|18.05.2011|18.05.2011|B-0447039-ENCR    |           9,02 |BRL  |8000800000  |Juros, Comissões e T       |                  |           |                                        |    |          |     |                     |CLB082902  |     |     |                 |COEL  |COEL  |B-0447039               |

这将取代任何类型的断线w,而不是紧跟着一根没有任何东西的管道:

  • Ctrl+H
  • 查找内容:
    \R(?)
  • 替换为:
    留空
  • 检查环绕
  • 检查正则表达式
  • 全部替换
说明:

\R          # any kind of linebreak (ie. \r, \n, \r\n)
(?!         # negative lookahead, zero length assertion that makes sure we do not have after:
    \|      # a pipe character
)           # end lookahead
| 1020941333    |     569|SP    |500000343 | 9|18.05.2011|15:27:00|18.05.2011|18.05.2011|Y-0444871-ENCR    |           1,93 |BRL  |8000800000  |Juros, Comissões e T       |                  |           |                                        |    |          |     |                     |CLB082902  |     |     |                 |COEL  |COEL  |Y-0444871               |
| 1020941586    |      43|SP    |500000344 |43|18.05.2011|15:41:43|18.05.2011|18.05.2011|B-0447039-ENCR    |           9,02 |BRL  |8000800000  |Juros, Comissões e T       |                  |           |                                        |    |          |     |                     |CLB082902  |     |     |                 |COEL  |COEL  |B-0447039               |
给定示例的结果:

\R          # any kind of linebreak (ie. \r, \n, \r\n)
(?!         # negative lookahead, zero length assertion that makes sure we do not have after:
    \|      # a pipe character
)           # end lookahead
| 1020941333    |     569|SP    |500000343 | 9|18.05.2011|15:27:00|18.05.2011|18.05.2011|Y-0444871-ENCR    |           1,93 |BRL  |8000800000  |Juros, Comissões e T       |                  |           |                                        |    |          |     |                     |CLB082902  |     |     |                 |COEL  |COEL  |Y-0444871               |
| 1020941586    |      43|SP    |500000344 |43|18.05.2011|15:41:43|18.05.2011|18.05.2011|B-0447039-ENCR    |           9,02 |BRL  |8000800000  |Juros, Comissões e T       |                  |           |                                        |    |          |     |                     |CLB082902  |     |     |                 |COEL  |COEL  |B-0447039               |

谢谢你的帮助。我做得很好。Regex是我努力学习和掌握的东西,但我真的做不到。@ArnoldSouza:不客气,很高兴它能帮上忙。以下网站真的很有趣:@ArnoldSouza:我可以知道你为什么不接受这个答案吗?问题在哪里?谢谢你的帮助。我做得很好。Regex是我努力学习和掌握的东西,但我真的做不到。@ArnoldSouza:不客气,很高兴它能帮上忙。以下网站非常有趣:@ArnoldSouza:我可以知道你为什么不接受这个答案吗?问题在哪里?