Text 清洁位置不当的CR和x2B;文本中的LF
我有一个TXT文件,我想导入Excel学习。但是,在导入之前,我正在努力处理文本的格式。这真是一团糟,你可以看到:Text 清洁位置不当的CR和x2B;文本中的LF,text,replace,notepad++,data-cleaning,code-cleanup,Text,Replace,Notepad++,Data Cleaning,Code Cleanup,我有一个TXT文件,我想导入Excel学习。但是,在导入之前,我正在努力处理文本的格式。这真是一团糟,你可以看到: | 1020941333 | 569|SP |500000343 | 9|18.05.2011|15:27:00|18.05.2011| 18.05.2011|Y-0444871-ENCR | 1,93 |BRL |8000800000 | Juros, Comissões e T |
| 1020941333 | 569|SP |500000343 | 9|18.05.2011|15:27:00|18.05.2011|
18.05.2011|Y-0444871-ENCR | 1,93 |BRL |8000800000 |
Juros, Comissões e T | | |
| | | |
|CLB082902 | | | |COEL |COEL |
Y-0444871 |
| 1020941586 | 43|SP |500000344 |43|18.05.2011|15:41:43|18.05.2011|
18.05.2011|B-0447039-ENCR | 9,02 |BRL |8000800000 |
Juros, Comissões e T | | |
| | | |
|CLB082902 | | | |COEL |COEL |
B-0447039 |
| 1021245920 | 956|SP |500000489 | 6|14.06.2011|15:24:02|14.06.2011|
14.06.2011|B-0447039-ENCR | 8,95 |BRL |8000800000 |
Juros, Comissões e T | | |
| | | |
|CLB082902 | | | |COEL |COEL |
B-0447039
|
所以我一直在寻找文本为何如此怪异的答案。我发现这是因为一些CR+LF(回车+换行)的位置不正确。我手动进行了一些更正,通过这些更正,我可以看到可以更好地组织文本,如下所示:
--------------------------------------------------------------------------------
| Nº documento | LL.|TpDoc.|Nº doc.ref|LL|Entrado em|Hora |Data doc. |Dt.lçto. |Elemento PEP | Valor/moeda ACC|MdACC|Cl.custo |Denom.classe custo |Material | Qtd.entr.|Texto breve material |UML |Doc.compra| Item|Texto do pedido |Usuário |DEs |Est |Nº ref.estorno |Empr. |EmFI |Definição do projeto
--------------------------------------------------------------------------------
| 1016939462 | 1|WE |5000058364| 1|22.02.2010|10:52:43|22.02.2010|22.02.2010|Y0444871PROJELMC | 540,93 |BRL |8000124000 |Serviço de Terceiro | | 1,000 | |UR |4501328844| 1|ESTUDOS E PROJ. REDE |CLB055760 | | | |COEL |COEL |Y-0444871 |
| 1020016002 | 1|WE |5000053667| 1|15.02.2011|11:56:05|15.02.2011|15.02.2011|B0447039PROJELMC | 2.011,84 |BRL |8000124000 |Serviço de Terceiro | | 1,000 | |UR |4501633481| 1|ESTUDOS E PROJ. REDE |CLB093440 | | | |COEL |COEL |B-0447039 |
| 1020258918 | 798|SP |500000121 | 8|15.03.2011|18:06:18|15.03.2011|15.03.2011|B-0447039-ENCR | 6,92 |BRL |8000800000 |Juros, Comissões e T | | | | | | | |CLB107395 | | | |COEL |COEL |B-0447039 |
| 1020585116 | 761|SP |500000225 | 1|15.04.2011|14:13:44|15.04.2011|15.04.2011|Y-0444871-ENCR | 1,88 |BRL |8000800000 |Juros, Comissões e T | | | | | | | |CLB145327 | | | |COEL |COEL |Y-0444871 |
| 1020586939 | 184|SP |500000230 | 4|15.04.2011|16:22:41|15.04.2011|15.04.2011|B-0447039-ENCR | 7,03 |BRL |8000800000 |Juros, Comissões e T | | | | | | | |CLB145327 | | | |COEL |COEL |B-0447039 |
我还可以在文本中看到一种模式。每一行都以这个字符开头。因此,对于不是以´|´开头的每一行,都应该与前一行连接
问题在于:
| 1020941333 | 569|SP |500000343 | 9|18.05.2011|15:27:00|18.05.2011|
18.05.2011|Y-0444871-ENCR | 1,93 |BRL |8000800000 |
Juros, Comissões e T | | |
| | | |
|CLB082902 | | | |COEL |COEL |
Y-0444871 |
| 1020941586 | 43|SP |500000344 |43|18.05.2011|15:41:43|18.05.2011|
18.05.2011|B-0447039-ENCR | 9,02 |BRL |8000800000 |
Juros, Comissões e T | | |
| | | |
|CLB082902 | | | |COEL |COEL |
B-0447039 |
期望输出
| 1020941333 | 569|SP |500000343 | 9|18.05.2011|15:27:00|18.05.2011|18.05.2011|Y-0444871-ENCR | 1,93 |BRL |8000800000 |Juros, Comissões e T | | | | | | | |CLB082902 | | | |COEL |COEL |Y-0444871 |
| 1020941586 | 43|SP |500000344 |43|18.05.2011|15:41:43|18.05.2011|18.05.2011|B-0447039-ENCR | 9,02 |BRL |8000800000 |Juros, Comissões e T | | | | | | | |CLB082902 | | | |COEL |COEL |B-0447039 |
在记事本++中实现它有很多困难。我无法手动执行此操作,因为该文件有超过490万行。如果有人能告诉我一些关于这个问题的信息,使用记事本++或其他更好的软件,我真的很感激 您可以使用正则表达式查找一个管道,后面跟一个换行符,并使用负向前看
(?!
检查管道右侧的内容不是开始新行的模式。然后替换为第一个捕获组以保留管道
查找内容:
\R # any kind of linebreak (ie. \r, \n, \r\n)
(?! # negative lookahead, zero length assertion that makes sure we do not have after:
\| # a pipe character
) # end lookahead
| 1020941333 | 569|SP |500000343 | 9|18.05.2011|15:27:00|18.05.2011|18.05.2011|Y-0444871-ENCR | 1,93 |BRL |8000800000 |Juros, Comissões e T | | | | | | | |CLB082902 | | | |COEL |COEL |Y-0444871 |
| 1020941586 | 43|SP |500000344 |43|18.05.2011|15:41:43|18.05.2011|18.05.2011|B-0447039-ENCR | 9,02 |BRL |8000800000 |Juros, Comissões e T | | | | | | | |CLB082902 | | | |COEL |COEL |B-0447039 |
(\\\\)\R(?!\\\\\\[\t]+\d+[\t]+\\\\\)
替换为:
\R # any kind of linebreak (ie. \r, \n, \r\n)
(?! # negative lookahead, zero length assertion that makes sure we do not have after:
\| # a pipe character
) # end lookahead
| 1020941333 | 569|SP |500000343 | 9|18.05.2011|15:27:00|18.05.2011|18.05.2011|Y-0444871-ENCR | 1,93 |BRL |8000800000 |Juros, Comissões e T | | | | | | | |CLB082902 | | | |COEL |COEL |Y-0444871 |
| 1020941586 | 43|SP |500000344 |43|18.05.2011|15:41:43|18.05.2011|18.05.2011|B-0447039-ENCR | 9,02 |BRL |8000800000 |Juros, Comissões e T | | | | | | | |CLB082902 | | | |COEL |COEL |B-0447039 |
$1
解释
匹配捕获组中的管道(\ |)
匹配unicode换行符序列\R
负前瞻(?!
匹配管道,1+乘以空格或制表符,1+位数,1+空格或制表符和管道
关闭反向前瞻)
请参见您可以使用正则表达式查找管道,然后是换行符,并使用负向前看
(?!
检查管道右侧的内容不是开始新行的模式。然后替换为第一个捕获组以保留管道
查找内容:
\R # any kind of linebreak (ie. \r, \n, \r\n)
(?! # negative lookahead, zero length assertion that makes sure we do not have after:
\| # a pipe character
) # end lookahead
| 1020941333 | 569|SP |500000343 | 9|18.05.2011|15:27:00|18.05.2011|18.05.2011|Y-0444871-ENCR | 1,93 |BRL |8000800000 |Juros, Comissões e T | | | | | | | |CLB082902 | | | |COEL |COEL |Y-0444871 |
| 1020941586 | 43|SP |500000344 |43|18.05.2011|15:41:43|18.05.2011|18.05.2011|B-0447039-ENCR | 9,02 |BRL |8000800000 |Juros, Comissões e T | | | | | | | |CLB082902 | | | |COEL |COEL |B-0447039 |
(\\\\)\R(?!\\\\\\[\t]+\d+[\t]+\\\\\)
替换为:
\R # any kind of linebreak (ie. \r, \n, \r\n)
(?! # negative lookahead, zero length assertion that makes sure we do not have after:
\| # a pipe character
) # end lookahead
| 1020941333 | 569|SP |500000343 | 9|18.05.2011|15:27:00|18.05.2011|18.05.2011|Y-0444871-ENCR | 1,93 |BRL |8000800000 |Juros, Comissões e T | | | | | | | |CLB082902 | | | |COEL |COEL |Y-0444871 |
| 1020941586 | 43|SP |500000344 |43|18.05.2011|15:41:43|18.05.2011|18.05.2011|B-0447039-ENCR | 9,02 |BRL |8000800000 |Juros, Comissões e T | | | | | | | |CLB082902 | | | |COEL |COEL |B-0447039 |
$1
解释
匹配捕获组中的管道(\ |)
匹配unicode换行符序列\R
负前瞻(?!
匹配管道,1+乘以空格或制表符,1+位数,1+空格或制表符和管道
关闭反向前瞻)
请参见这将替换任何类型的断线,而不是后面没有任何管道的断线:
- Ctrl+H
- 查找内容:
\R(?)
- 替换为:
留空
- 检查环绕
- 检查正则表达式
- 全部替换
\R # any kind of linebreak (ie. \r, \n, \r\n)
(?! # negative lookahead, zero length assertion that makes sure we do not have after:
\| # a pipe character
) # end lookahead
| 1020941333 | 569|SP |500000343 | 9|18.05.2011|15:27:00|18.05.2011|18.05.2011|Y-0444871-ENCR | 1,93 |BRL |8000800000 |Juros, Comissões e T | | | | | | | |CLB082902 | | | |COEL |COEL |Y-0444871 |
| 1020941586 | 43|SP |500000344 |43|18.05.2011|15:41:43|18.05.2011|18.05.2011|B-0447039-ENCR | 9,02 |BRL |8000800000 |Juros, Comissões e T | | | | | | | |CLB082902 | | | |COEL |COEL |B-0447039 |
给定示例的结果:
\R # any kind of linebreak (ie. \r, \n, \r\n)
(?! # negative lookahead, zero length assertion that makes sure we do not have after:
\| # a pipe character
) # end lookahead
| 1020941333 | 569|SP |500000343 | 9|18.05.2011|15:27:00|18.05.2011|18.05.2011|Y-0444871-ENCR | 1,93 |BRL |8000800000 |Juros, Comissões e T | | | | | | | |CLB082902 | | | |COEL |COEL |Y-0444871 |
| 1020941586 | 43|SP |500000344 |43|18.05.2011|15:41:43|18.05.2011|18.05.2011|B-0447039-ENCR | 9,02 |BRL |8000800000 |Juros, Comissões e T | | | | | | | |CLB082902 | | | |COEL |COEL |B-0447039 |
这将取代任何类型的断线w,而不是紧跟着一根没有任何东西的管道:
- Ctrl+H
- 查找内容:
\R(?)
- 替换为:
留空
- 检查环绕
- 检查正则表达式
- 全部替换
\R # any kind of linebreak (ie. \r, \n, \r\n)
(?! # negative lookahead, zero length assertion that makes sure we do not have after:
\| # a pipe character
) # end lookahead
| 1020941333 | 569|SP |500000343 | 9|18.05.2011|15:27:00|18.05.2011|18.05.2011|Y-0444871-ENCR | 1,93 |BRL |8000800000 |Juros, Comissões e T | | | | | | | |CLB082902 | | | |COEL |COEL |Y-0444871 |
| 1020941586 | 43|SP |500000344 |43|18.05.2011|15:41:43|18.05.2011|18.05.2011|B-0447039-ENCR | 9,02 |BRL |8000800000 |Juros, Comissões e T | | | | | | | |CLB082902 | | | |COEL |COEL |B-0447039 |
给定示例的结果:
\R # any kind of linebreak (ie. \r, \n, \r\n)
(?! # negative lookahead, zero length assertion that makes sure we do not have after:
\| # a pipe character
) # end lookahead
| 1020941333 | 569|SP |500000343 | 9|18.05.2011|15:27:00|18.05.2011|18.05.2011|Y-0444871-ENCR | 1,93 |BRL |8000800000 |Juros, Comissões e T | | | | | | | |CLB082902 | | | |COEL |COEL |Y-0444871 |
| 1020941586 | 43|SP |500000344 |43|18.05.2011|15:41:43|18.05.2011|18.05.2011|B-0447039-ENCR | 9,02 |BRL |8000800000 |Juros, Comissões e T | | | | | | | |CLB082902 | | | |COEL |COEL |B-0447039 |
谢谢你的帮助。我做得很好。Regex是我努力学习和掌握的东西,但我真的做不到。@ArnoldSouza:不客气,很高兴它能帮上忙。以下网站真的很有趣:@ArnoldSouza:我可以知道你为什么不接受这个答案吗?问题在哪里?谢谢你的帮助。我做得很好。Regex是我努力学习和掌握的东西,但我真的做不到。@ArnoldSouza:不客气,很高兴它能帮上忙。以下网站非常有趣:@ArnoldSouza:我可以知道你为什么不接受这个答案吗?问题在哪里?