Regex 正则表达式-记事本++;搜索并替换丢失的线路
我对regex非常陌生,我正在尝试使用Notepad++清理一些CSV文件。我运行的是7.8.2版(64位),因为我的文件太大,32位版本无法打开 在数据中,大多数字段都是标准化的,并由系统自动生成。每行正好有30个字段。但是,用户可以在一个字段中输入注释,在少数情况下,用户在该字段中输入了换行符。发生这种情况时,Notepad++会为此数据创建新行 例如,下面的第三行应该是第二行的延续(根据原始帖子中的简明示例编辑): 我正在尝试删除第二行中的额外换行符,以使数据看起来像:Regex 正则表达式-记事本++;搜索并替换丢失的线路,regex,csv,notepad++,Regex,Csv,Notepad++,我对regex非常陌生,我正在尝试使用Notepad++清理一些CSV文件。我运行的是7.8.2版(64位),因为我的文件太大,32位版本无法打开 在数据中,大多数字段都是标准化的,并由系统自动生成。每行正好有30个字段。但是,用户可以在一个字段中输入注释,在少数情况下,用户在该字段中输入了换行符。发生这种情况时,Notepad++会为此数据创建新行 例如,下面的第三行应该是第二行的延续(根据原始帖子中的简明示例编辑): 我正在尝试删除第二行中的额外换行符,以使数据看起来像: "39901","
"39901","0002286898","88","ACTUALS","TO RECORD ACCRUED LIABILITIES FOR GOODS OR SERVICES RECEIVED AT JUNE 30 2016 PER ATTACHED SCHEDULE. FOR 39901, IU journal 2297455 CONTACT: [NAME PHONE NUMBER] / [NAME PHONE NUMBER]","LA","34000000","Accrued Liabilities","","11000","","","","","","","","","","","","","2017","1","191313.130","07/28/2016","07/01/2016","","Accrued Liabilities",""
"39901","0002290128","7","ACTUALS","To record accrued liabilities for goods or services received at June 30, 2016 per the attached schedule. Contact [NAME PHONE NUMBER EMAIL] or [NAME PHONE NUMBER EMAIL]","LA","34000000","Accrued Liabilities","","11000","","","","","","","","","","","","","2017","1","2556242.170","07/31/2016","07/01/2016","","Accrued Liabilities",""
"39901","0002291224","37","ACTUALS","TO RECORD ACCRUED LIABILITIES FOR GOODS OR SERVICES RECEIVED AT JUNE 30 PER THE ATTACHED SCHEDULE. FOR 34530, CONTACT: [NAME PHONE NUMBER EMAIL]","LA","34000000","Accrued Liabilities","","11000","","","","","","","","","","","","","2017","1","3010262.140","07/27/2016","07/01/2016","","Accrued Liabilities",""
"39901","0002291259","2","ACTUALS","TO RECORD ACCRUED LIABILITIES FOR GOODS OR SERVICES RECEIVED AT JUNE 30 PER THE ATTACHED SCHEDULE. FOR 34571, CONTACT: [NAME PHONE NUMBER] / [NAME PHONE NUMBER]","LA","34000000","Accrued Liabilities","","11000","","","","","","","","","","","","","2017","1","38140.260","07/27/2016","07/01/2016","","Accrued Liabilities",""
"39901","0002291336","12","ACTUALS","TO RECORD ACCRUED LIABILITIES FOR GOODS OR SERVICES RECEIVED AT JUNE 30 PER ATTACHED SCHEDULE. FOR 345.20","LA","34000000","Accrued Liabilities","","11000","","","","","","","","","","","","","2017","1","2768000.000","08/01/2016","07/01/2016","","Accrued Liabilities",""
没有回车符,只有换行符,因此搜索\n
也会标记所有应该合法结束该行的换行符
在这种情况下,数据的结构使最后一列始终为空(“”)
。因此,我尝试搜索结尾不为空的行,即以字母、数字、句点、空格等结尾的行。我的计划是用唯一的奇数词替换这些实例,然后进行第二次扩展搜索和替换,以删除新表达式和换行符
虽然笨拙,但我一直在分步进行:
查找最后一个字符为数字的行李>\d{1}$
查找最后一个字符为字母的行李>\w{1}$
查找最后一个字符为空白的行;及\s{1}$
查找以句点结尾的行\.$
然后,我将进行最后一次搜索,以查找任何不是以
39901
开头的掉队者
我将这些搜索作为常规搜索运行,然后将其替换为REPLACEHERE999
,我假设没有其他人输入数据。我知道这将删除并替换行中的最后一个字符–最终的数字、字母、空格等–但我可以接受。在完成这些替换之后,我计划进行第二次扩展搜索,用一个空格替换掉REPLACEHERE999\un
,同时去掉REPLACEHERE999\uu
和换行符
当我进行第一次搜索时,他们会根据我最初在Power Query–377中获得的错误数进行合理的替换,例如\d{1}$
。但是,一旦我进行了这些替换,行数就会显著减少。最初,我有3919186行,但在第一次搜索和替换之后—\d{1}$
,我只有1543818行,不到我开始时的一半。当我一次完成前几个替换时,我不会丢失行,但当我使用“全部替换”时,它们就会消失
同样,我刚开始使用regex/Notepad++,所以我可能缺少一些基本的东西。但是,如果我只做了有限数量的替换,为什么我的很多行都消失了呢
欢迎对我的搜索或思考提出意见和建议,但消失的线条是这里的关键问题
谢谢 - Ctrl+H
- 查找内容:
\R(?)
- 替换为:
留空
- 检查环绕
- 检查正则表达式
- 全部替换
\R # any kind of linebreak
(?!“) # negative lookahead, make sure we haven't “ after
屏幕截图(之前):
\R # any kind of linebreak
(?!“) # negative lookahead, make sure we haven't “ after
屏幕截图(之后):
\R # any kind of linebreak
(?!“) # negative lookahead, make sure we haven't “ after
- Ctrl+H
- 查找内容:
\R(?)
- 替换为:
留空
- 检查环绕
- 检查正则表达式
- 全部替换
\R # any kind of linebreak
(?!“) # negative lookahead, make sure we haven't “ after
屏幕截图(之前):
\R # any kind of linebreak
(?!“) # negative lookahead, make sure we haven't “ after
屏幕截图(之后):
\R # any kind of linebreak
(?!“) # negative lookahead, make sure we haven't “ after
假设每行正好包含30列,每列可以包含双引号以外的任何字符: 打开扩展模式和正则表达式搜索并环绕, 您可以通过两个步骤完成此操作:
((“[^”]*”,){29}([^”]*”)\s?
并将其替换为“替换为:”字段中的
$1\n
- 每个字段的格式为
。在您的示例中,共有30行,前29行后跟逗号“[^”]*”
- 在我的正则表达式中,允许的字符是除双引号以外的所有字符
- 让我们将
表示为[^”]
。然后每个字段的格式为\x
,然后我们将regex“\x*”
重复多次。我们为该格式的每个段添加一行新行(“\x*”,{29}”\x*”)
可以处理每30个条目后的剩余空间\s?
注意:链接使用上一个包含较少的正则表达式。假设每行正好包含30列,每列可以包含双引号以外的任何字符: 打开扩展模式和正则表达式搜索并环绕, 您可以通过两个步骤完成此操作:
((“[^”]*”,){29}([^”]*”)\s?
并将其替换为“替换为:”字段中的
$1\n
- 每个字段的格式为
。在您的示例中,共有30行,前29行后跟逗号“[^”]*”
- 在我的正则表达式中,允许的字符是除双引号以外的所有字符
- 让我们将
表示为[^”]
。然后每个字段的形式为\x
,然后我们将regex“\x*”
重复多次。我们为每个segme添加一行新行(“\x*”,{29}”\x*”)