Python 匹配文件名并替换为新名称
我有两个相当大的.txt文件,具有相似的ID标记。我需要做的是从一个文件中获取ID标记,在另一个文件中匹配它,并用第一个文件中的名称替换ID。我需要有1000+标签做这个。关键是要精确匹配第一个文件中ID标记名的一部分并替换它Python 匹配文件名并替换为新名称,python,bash,shell,python-2.7,notepad++,Python,Bash,Shell,Python 2.7,Notepad++,我有两个相当大的.txt文件,具有相似的ID标记。我需要做的是从一个文件中获取ID标记,在另一个文件中匹配它,并用第一个文件中的名称替换ID。我需要有1000+标签做这个。关键是要精确匹配第一个文件中ID标记名的一部分并替换它 每行有一个唯一的ID标记,并且两个文件之间总是精确匹配(对于位置[6-16]=“10737.G1C22”);匹配项是分散的,因此File1.txt中的第1行可能与File2.txt中的第504行匹配 两个文件中的行顺序都无法排序,必须保持 例如: File1.txt
- 每行有一个唯一的ID标记,并且两个文件之间总是精确匹配(对于位置[6-16]=“10737.G1C22”);匹配项是分散的,因此File1.txt中的第1行可能与File2.txt中的第504行匹配
- 两个文件中的行顺序都无法排序,必须保持
File1.txt =
TYPE1_10737.G1C22 ---------
...
File2.txt =
10737.G1C22 ----------
我需要File1.txt中的名称,特别是“10737.G1C22”,以便在File2.txt中找到它的精确匹配项,并将其替换为“TYPE1_10737.G1C22”
然后,编辑将如下所示,此时File2.txt中的名称根据File1.txt中的匹配项进行了更改:
File2.txt =
TYPE1_10737.G1C22 ---------
...
我尝试了一些sed函数,但遇到了问题。重要的是,一旦找到完全匹配的名称,只更改名称的前6个字符,而不更改任何其他字符。需要匹配和更改的ID标签超过1000个
我想到的代码告诉它精确匹配位置[6-16],并将其替换为File1.txt中的[0-16]
非常感谢您的帮助。这可能吗?我也愿意接受其他建议。多谢各位 Bash和ed
解决方案
- 第一步。创建
和File1.txt
这两个文件或多或少都像您的文件一样,以进行实验并获得一些乐趣(1000行)。使用此脚本(在临时目录中):File2.txt
- 第二步。使用标准编辑器
,进行替换,并封装在以下脚本中:ed
#!/bin/bash ed -s File2.txt < <( while read l _; do p=${l:6} p=${p//./\\.} echo "%s/^$p/$l/" done < File1.txt echo wq )
ed
是一个真正的编辑器,因此替换已到位File2.txt
确实是经过编辑的
嘿,等等,我可能忽略了你16个字符的要求。。。我用了一个事实,在你的图案后面有一个空格。如果我的解决方案在这一点上不好,请让我知道,我会适当地修改它。Bash和
ed
solution
- 第一步。创建
和File1.txt
这两个文件或多或少都像您的文件一样,以进行实验并获得一些乐趣(1000行)。使用此脚本(在临时目录中):File2.txt
- 第二步。使用标准编辑器
,进行替换,并封装在以下脚本中:ed
#!/bin/bash ed -s File2.txt < <( while read l _; do p=${l:6} p=${p//./\\.} echo "%s/^$p/$l/" done < File1.txt echo wq )
ed
是一个真正的编辑器,因此替换已到位File2.txt
确实是经过编辑的
嘿,等等,我可能忽略了你16个字符的要求。。。我用了一个事实,在你的图案后面有一个空格。如果我的解决方案在这一点上不好,请告诉我,我会适当地修改它。基于Python的解决方案会很简单,但是请注意,这无法在适当的位置完成,例如,您必须将结果存储到某个新位置 如果您的文件不太大,即您可以在内存中构造映射,则映射是直接的(假设1)名称与id之间用下划线分隔,2)id与文本之间用空格分隔,如示例3)每行同时包含id和名称4)文件中每个id只有一个名称1): 一旦有了映射,就可以轻松地进行替换(如果没有找到匹配项,则保持字符串不变):
您只需将生成的生成器存储到文件中。基于Python的解决方案将非常简单,但请注意,这无法在适当的位置完成,例如,您必须将结果存储到某个新位置 如果您的文件不太大,即您可以在内存中构造映射,则映射是直接的(假设1)名称与id之间用下划线分隔,2)id与文本之间用空格分隔,如示例3)每行同时包含id和名称4)文件中每个id只有一个名称1): 一旦有了映射,就可以轻松地进行替换(如果没有找到匹配项,则保持字符串不变):
您只需将生成的生成器存储到文件。一个简单的python解决方案:
from collections import OrderedDict
LINES_PER_CYCLE = 1000
with open('output.txt', 'wb') as output, open('test_2.txt', 'rb') as fin:
fin_line = ''
# Loop until fin reaches EOF.
while True:
cache = OrderedDict()
# Fill the cache with up to LINES_PER_CYCLE entries.
for _ in xrange(LINES_PER_CYCLE):
fin_line = fin.readline()
if not fin_line:
break
key, rest = fin_line.strip().split(' ', 1)
cache[key] = ['', rest]
# Loop over the file_1.txt to find tags with given id.
with open('test_1.txt', 'rb') as fout:
for line in fout:
tag, _ = line.split(' ', 1)
_, idx = tag.rsplit('_', 1)
if idx in cache:
cache[idx][0] = tag
# Write matched lines to the output file, in the same order
# as the lines were inserted into the cache.
for _, (tag, rest) in cache.iteritems():
output.write('{} {}\n'.format(tag, rest))
# If fin has reached EOF, break.
if not fin_line:
break
它所做的是从文件\u 2.txt
中读取行/u循环
条目,在文件\u 1.txt
中查找匹配条目并写入输出。由于内存有限(用于缓存),文件_1.txt
会被搜索多次
这假设标记/id部分由
----
中的空格分隔,标记和id由下划线分隔,即“tag_idx blah blah blah”。python中的一个简单解决方案:
from collections import OrderedDict
LINES_PER_CYCLE = 1000
with open('output.txt', 'wb') as output, open('test_2.txt', 'rb') as fin:
fin_line = ''
# Loop until fin reaches EOF.
while True:
cache = OrderedDict()
# Fill the cache with up to LINES_PER_CYCLE entries.
for _ in xrange(LINES_PER_CYCLE):
fin_line = fin.readline()
if not fin_line:
break
key, rest = fin_line.strip().split(' ', 1)
cache[key] = ['', rest]
# Loop over the file_1.txt to find tags with given id.
with open('test_1.txt', 'rb') as fout:
for line in fout:
tag, _ = line.split(' ', 1)
_, idx = tag.rsplit('_', 1)
if idx in cache:
cache[idx][0] = tag
# Write matched lines to the output file, in the same order
# as the lines were inserted into the cache.
for _, (tag, rest) in cache.iteritems():
output.write('{} {}\n'.format(tag, rest))
# If fin has reached EOF, break.
if not fin_line:
break
它所做的是从文件\u 2.txt
中读取行/u循环
条目,在文件\u 1.txt
中查找匹配条目并写入输出。由于内存有限(用于缓存),文件_1.txt
会被搜索多次
这假设标记/id部分用空格与
----
分隔,标记和id用下划线分隔,即“tag_idx blah blah blah blah”。我会将第一个文件加载到dict中,然后处理第二个文件以匹配键,并将任何更改输出到第三个文件:
import re
# Pattern to match in File1
pattern1 = "(\w+)_(\d+\.\w+)\s+.*$"
# Pattern to match in File2
pattern2 = "(\d+\.\w+)\s+.*$"
# Load the 'master' file into a dict,
# with the number as key and 'type' as value.
file1_dict = dict()
with open("File1.txt", "r") as f:
for line in f.readlines():
m = re.match(pattern1, line)
if m:
file1_dict[m.group(2)] = m.group(1)
# Open a new output file to replace File2.txt
with open("File3.txt", "w") as fnew:
# As you process each line in File2.txt,
# find matching entry in above File1 list.
# Either write the old unmatched value or new
# matching, changed value to File3.txt
with open("File2.txt", "r") as f:
for line in f.readlines():
is_found = False
m = re.match(pattern2, line)
if m:
if m.group(1) in file1_dict:
is_found = True
fnew.write("{0}_{1}".format(file1_dict[m.group(1)], line))
if not is_found:
fnew.write(line)
# Then just overwrite File2.txt with new File3.txt contents.
# Original File1.txt
TYPE1_10737.G1C22 ---------
TYPE1_10738.G1C22 ---------
TYPE1_10739.G1C22 ---------
TYPE1_10740.G1C22 ---------
TYPE1_10741.G1C22 ---------
TYPE1_10742.G1C22 ---------
TYPE1_10799.G1C22 ---------
# Original File2.txt
10737.G1C22 ---------
10738.G1C22 ---------
10739.G1C22 ---------
10740.G1C22 ---------
10788.G1C22 ---------
10741.G1C22 ---------
10742.G1C22 ---------
# Results of new File3.txt
TYPE1_10737.G1C22 ---------
TYPE1_10738.G1C22 ---------
TYPE1_10739.G1C22 ---------
TYPE1_10740.G1C22 ---------
10788.G1C22 ---------
TYPE1_10741.G1C22 ---------
TYPE1_10742.G1C22 ---------
我会将第一个文件加载到dict中,然后处理第二个文件以匹配密钥,并将任何更改输出到第三个文件:
import re
# Pattern to match in File1
pattern1 = "(\w+)_(\d+\.\w+)\s+.*$"
# Pattern to match in File2
pattern2 = "(\d+\.\w+)\s+.*$"
# Load the 'master' file into a dict,
# with the number as key and 'type' as value.
file1_dict = dict()
with open("File1.txt", "r") as f:
for line in f.readlines():
m = re.match(pattern1, line)
if m:
file1_dict[m.group(2)] = m.group(1)
# Open a new output file to replace File2.txt
with open("File3.txt", "w") as fnew:
# As you process each line in File2.txt,
# find matching entry in above File1 list.
# Either write the old unmatched value or new
# matching, changed value to File3.txt
with open("File2.txt", "r") as f:
for line in f.readlines():
is_found = False
m = re.match(pattern2, line)
if m:
if m.group(1) in file1_dict:
is_found = True
fnew.write("{0}_{1}".format(file1_dict[m.group(1)], line))
if not is_found:
fnew.write(line)
# Then just overwrite File2.txt with new File3.txt contents.
# Original File1.txt
TYPE1_10737.G1C22 ---------
TYPE1_10738.G1C22 ---------
TYPE1_10739.G1C22 ---------
TYPE1_10740.G1C22 ---------
TYPE1_10741.G1C22 ---------
TYPE1_10742.G1C22 ---------
TYPE1_10799.G1C22 ---------
# Original File2.txt
10737.G1C22 ---------
10738.G1C22 ---------
10739.G1C22 ---------
10740.G1C22 ---------
10788.G1C22 ---------
10741.G1C22 ---------
10742.G1C22 ---------
# Results of new File3.txt
TYPE1_10737.G1C22 ---------
TYPE1_10738.G1C22 ---------
TYPE1_10739.G1C22 ---------
TYPE1_10740.G1C22 ---------
10788.G1C22 ---------
TYPE1_10741.G1C22 ---------
TYPE1_10742.G1C22 ---------
每行一个ID,总是在每个文件的行的开头?每个file2 ID在file1中是否最多有一个匹配项?是。每行一个ID,是的,文件之间至少有一个匹配项。非常感谢。一种方法是从file1生成一个dictionary(我看到了您的python标记),其中键是要在file2中找到的匹配项,dictionary值是全名。然后读取文件2,在字典中查找
import re
# Pattern to match in File1
pattern1 = "(\w+)_(\d+\.\w+)\s+.*$"
# Pattern to match in File2
pattern2 = "(\d+\.\w+)\s+.*$"
# Load the 'master' file into a dict,
# with the number as key and 'type' as value.
file1_dict = dict()
with open("File1.txt", "r") as f:
for line in f.readlines():
m = re.match(pattern1, line)
if m:
file1_dict[m.group(2)] = m.group(1)
# Open a new output file to replace File2.txt
with open("File3.txt", "w") as fnew:
# As you process each line in File2.txt,
# find matching entry in above File1 list.
# Either write the old unmatched value or new
# matching, changed value to File3.txt
with open("File2.txt", "r") as f:
for line in f.readlines():
is_found = False
m = re.match(pattern2, line)
if m:
if m.group(1) in file1_dict:
is_found = True
fnew.write("{0}_{1}".format(file1_dict[m.group(1)], line))
if not is_found:
fnew.write(line)
# Then just overwrite File2.txt with new File3.txt contents.
# Original File1.txt
TYPE1_10737.G1C22 ---------
TYPE1_10738.G1C22 ---------
TYPE1_10739.G1C22 ---------
TYPE1_10740.G1C22 ---------
TYPE1_10741.G1C22 ---------
TYPE1_10742.G1C22 ---------
TYPE1_10799.G1C22 ---------
# Original File2.txt
10737.G1C22 ---------
10738.G1C22 ---------
10739.G1C22 ---------
10740.G1C22 ---------
10788.G1C22 ---------
10741.G1C22 ---------
10742.G1C22 ---------
# Results of new File3.txt
TYPE1_10737.G1C22 ---------
TYPE1_10738.G1C22 ---------
TYPE1_10739.G1C22 ---------
TYPE1_10740.G1C22 ---------
10788.G1C22 ---------
TYPE1_10741.G1C22 ---------
TYPE1_10742.G1C22 ---------