Python 匹配文件名并替换为新名称_Python_Bash_Shell_Python 2.7_Notepad++

Python 匹配文件名并替换为新名称

python bash shell python-2.7 notepad++

Python 匹配文件名并替换为新名称,python,bash,shell,python-2.7,notepad++,Python,Bash,Shell,Python 2.7,Notepad++,我有两个相当大的.txt文件，具有相似的ID标记。我需要做的是从一个文件中获取ID标记，在另一个文件中匹配它，并用第一个文件中的名称替换ID。我需要有1000+标签做这个。关键是要精确匹配第一个文件中ID标记名的一部分并替换它每行有一个唯一的ID标记，并且两个文件之间总是精确匹配（对于位置[6-16]=“10737.G1C22”）；匹配项是分散的，因此File1.txt中的第1行可能与File2.txt中的第504行匹配两个文件中的行顺序都无法排序，必须保持例如： File1.txt

我有两个相当大的.txt文件，具有相似的ID标记。我需要做的是从一个文件中获取ID标记，在另一个文件中匹配它，并用第一个文件中的名称替换ID。我需要有1000+标签做这个。关键是要精确匹配第一个文件中ID标记名的一部分并替换它

每行有一个唯一的ID标记，并且两个文件之间总是精确匹配（对于位置[6-16]=“10737.G1C22”）；匹配项是分散的，因此File1.txt中的第1行可能与File2.txt中的第504行匹配
两个文件中的行顺序都无法排序，必须保持

例如：

File1.txt = 
TYPE1_10737.G1C22 ---------
...

File2.txt = 
10737.G1C22 ----------

我需要File1.txt中的名称，特别是“10737.G1C22”，以便在File2.txt中找到它的精确匹配项，并将其替换为“TYPE1_10737.G1C22”

然后，编辑将如下所示，此时File2.txt中的名称根据File1.txt中的匹配项进行了更改：

 File2.txt = 
 TYPE1_10737.G1C22 ---------
 ...

我尝试了一些sed函数，但遇到了问题。重要的是，一旦找到完全匹配的名称，只更改名称的前6个字符，而不更改任何其他字符。需要匹配和更改的ID标签超过1000个

我想到的代码告诉它精确匹配位置[6-16]，并将其替换为File1.txt中的[0-16]

非常感谢您的帮助。这可能吗？我也愿意接受其他建议。多谢各位

Bash和

ed

解决方案

第一步。创建
```
File1.txt
```
和
```
File2.txt
```
这两个文件或多或少都像您的文件一样，以进行实验并获得一些乐趣（1000行）。使用此脚本（在临时目录中）：

第二步。使用标准编辑器

ed

，进行替换，并封装在以下脚本中：

#!/bin/bash

ed -s File2.txt < <(
   while read l _; do
      p=${l:6}
      p=${p//./\\.}
      echo "%s/^$p/$l/"
   done < File1.txt
   echo wq
)

完成了

注意。由于

ed

是一个真正的编辑器，因此替换已到位

File2.txt

确实是经过编辑的

嘿，等等，我可能忽略了你16个字符的要求。。。我用了一个事实，在你的图案后面有一个空格。如果我的解决方案在这一点上不好，请让我知道，我会适当地修改它。

Bash和

ed

solution

第一步。创建
```
File1.txt
```
和
```
File2.txt
```
这两个文件或多或少都像您的文件一样，以进行实验并获得一些乐趣（1000行）。使用此脚本（在临时目录中）：

第二步。使用标准编辑器

ed

，进行替换，并封装在以下脚本中：

#!/bin/bash

ed -s File2.txt < <(
   while read l _; do
      p=${l:6}
      p=${p//./\\.}
      echo "%s/^$p/$l/"
   done < File1.txt
   echo wq
)

完成了

注意。由于

ed

是一个真正的编辑器，因此替换已到位

File2.txt

确实是经过编辑的

嘿，等等，我可能忽略了你16个字符的要求。。。我用了一个事实，在你的图案后面有一个空格。如果我的解决方案在这一点上不好，请告诉我，我会适当地修改它。

基于Python的解决方案会很简单，但是请注意，这无法在适当的位置完成，例如，您必须将结果存储到某个新位置

如果您的文件不太大，即您可以在内存中构造映射，则映射是直接的（假设1）名称与id之间用下划线分隔，2）id与文本之间用空格分隔，如示例3）每行同时包含id和名称4）文件中每个id只有一个名称1）：

一旦有了映射，就可以轻松地进行替换（如果没有找到匹配项，则保持字符串不变）：

您只需将生成的生成器存储到文件中。

基于Python的解决方案将非常简单，但请注意，这无法在适当的位置完成，例如，您必须将结果存储到某个新位置

一旦有了映射，就可以轻松地进行替换（如果没有找到匹配项，则保持字符串不变）：

您只需将生成的生成器存储到文件。

一个简单的python解决方案：

from collections import OrderedDict
LINES_PER_CYCLE = 1000

with open('output.txt', 'wb') as output, open('test_2.txt', 'rb') as fin:
    fin_line = ''

    # Loop until fin reaches EOF.
    while True:
        cache = OrderedDict()

        # Fill the cache with up to LINES_PER_CYCLE entries.
        for _ in xrange(LINES_PER_CYCLE):
            fin_line = fin.readline()
            if not fin_line:
                break

            key, rest = fin_line.strip().split(' ', 1)
            cache[key] = ['', rest]

        # Loop over the file_1.txt to find tags with given id.    
        with open('test_1.txt', 'rb') as fout:
            for line in fout:
                tag, _ = line.split(' ', 1)
                _, idx = tag.rsplit('_', 1)
                if idx in cache:
                    cache[idx][0] = tag

        # Write matched lines to the output file, in the same order
        # as the lines were inserted into the cache.
        for _, (tag, rest) in cache.iteritems():
            output.write('{} {}\n'.format(tag, rest))

        # If fin has reached EOF, break.    
        if not fin_line:
            break

它所做的是从

文件\u 2.txt

中读取

行/u循环

条目，在

文件\u 1.txt

中查找匹配条目并写入输出。由于内存有限（用于缓存），

文件_1.txt

会被搜索多次

这假设标记/id部分由

----

中的空格分隔，标记和id由下划线分隔，即“tag_idx blah blah blah”。

python中的一个简单解决方案：

from collections import OrderedDict
LINES_PER_CYCLE = 1000

with open('output.txt', 'wb') as output, open('test_2.txt', 'rb') as fin:
    fin_line = ''

    # Loop until fin reaches EOF.
    while True:
        cache = OrderedDict()

        # Fill the cache with up to LINES_PER_CYCLE entries.
        for _ in xrange(LINES_PER_CYCLE):
            fin_line = fin.readline()
            if not fin_line:
                break

            key, rest = fin_line.strip().split(' ', 1)
            cache[key] = ['', rest]

        # Loop over the file_1.txt to find tags with given id.    
        with open('test_1.txt', 'rb') as fout:
            for line in fout:
                tag, _ = line.split(' ', 1)
                _, idx = tag.rsplit('_', 1)
                if idx in cache:
                    cache[idx][0] = tag

        # Write matched lines to the output file, in the same order
        # as the lines were inserted into the cache.
        for _, (tag, rest) in cache.iteritems():
            output.write('{} {}\n'.format(tag, rest))

        # If fin has reached EOF, break.    
        if not fin_line:
            break

它所做的是从

文件\u 2.txt

中读取

行/u循环

条目，在

文件\u 1.txt

中查找匹配条目并写入输出。由于内存有限（用于缓存），

文件_1.txt

会被搜索多次

这假设标记/id部分用空格与

----

分隔，标记和id用下划线分隔，即“tag_idx blah blah blah blah”。

我会将第一个文件加载到dict中，然后处理第二个文件以匹配键，并将任何更改输出到第三个文件：

import re

# Pattern to match in File1
pattern1 = "(\w+)_(\d+\.\w+)\s+.*$"

# Pattern to match in File2
pattern2 = "(\d+\.\w+)\s+.*$"

# Load the 'master' file into a dict,
# with the number as key and 'type' as value.
file1_dict = dict()
with open("File1.txt", "r") as f:
    for line in f.readlines():
        m = re.match(pattern1, line)
        if m:
            file1_dict[m.group(2)] = m.group(1)

# Open a new output file to replace File2.txt
with open("File3.txt", "w") as fnew:
    # As you process each line in File2.txt,
    # find matching entry in above File1 list.
    # Either write the old unmatched value or new
    # matching, changed value to File3.txt
    with open("File2.txt", "r") as f:
        for line in f.readlines():
            is_found = False
            m = re.match(pattern2, line)
            if m:
                if m.group(1) in file1_dict:
                    is_found = True
                    fnew.write("{0}_{1}".format(file1_dict[m.group(1)], line))
            if not is_found:
                fnew.write(line)

# Then just overwrite File2.txt with new File3.txt contents.

# Original File1.txt
TYPE1_10737.G1C22 ---------
TYPE1_10738.G1C22 ---------
TYPE1_10739.G1C22 ---------
TYPE1_10740.G1C22 ---------
TYPE1_10741.G1C22 ---------
TYPE1_10742.G1C22 ---------
TYPE1_10799.G1C22 ---------

# Original File2.txt
10737.G1C22 ---------
10738.G1C22 ---------
10739.G1C22 ---------
10740.G1C22 ---------
10788.G1C22 ---------
10741.G1C22 ---------
10742.G1C22 ---------

# Results of new File3.txt
TYPE1_10737.G1C22 ---------
TYPE1_10738.G1C22 ---------
TYPE1_10739.G1C22 ---------
TYPE1_10740.G1C22 ---------
10788.G1C22 ---------
TYPE1_10741.G1C22 ---------
TYPE1_10742.G1C22 ---------

我会将第一个文件加载到dict中，然后处理第二个文件以匹配密钥，并将任何更改输出到第三个文件：

import re

# Pattern to match in File1
pattern1 = "(\w+)_(\d+\.\w+)\s+.*$"

# Pattern to match in File2
pattern2 = "(\d+\.\w+)\s+.*$"

# Load the 'master' file into a dict,
# with the number as key and 'type' as value.
file1_dict = dict()
with open("File1.txt", "r") as f:
    for line in f.readlines():
        m = re.match(pattern1, line)
        if m:
            file1_dict[m.group(2)] = m.group(1)

# Open a new output file to replace File2.txt
with open("File3.txt", "w") as fnew:
    # As you process each line in File2.txt,
    # find matching entry in above File1 list.
    # Either write the old unmatched value or new
    # matching, changed value to File3.txt
    with open("File2.txt", "r") as f:
        for line in f.readlines():
            is_found = False
            m = re.match(pattern2, line)
            if m:
                if m.group(1) in file1_dict:
                    is_found = True
                    fnew.write("{0}_{1}".format(file1_dict[m.group(1)], line))
            if not is_found:
                fnew.write(line)

# Then just overwrite File2.txt with new File3.txt contents.

# Original File1.txt
TYPE1_10737.G1C22 ---------
TYPE1_10738.G1C22 ---------
TYPE1_10739.G1C22 ---------
TYPE1_10740.G1C22 ---------
TYPE1_10741.G1C22 ---------
TYPE1_10742.G1C22 ---------
TYPE1_10799.G1C22 ---------

# Original File2.txt
10737.G1C22 ---------
10738.G1C22 ---------
10739.G1C22 ---------
10740.G1C22 ---------
10788.G1C22 ---------
10741.G1C22 ---------
10742.G1C22 ---------

# Results of new File3.txt
TYPE1_10737.G1C22 ---------
TYPE1_10738.G1C22 ---------
TYPE1_10739.G1C22 ---------
TYPE1_10740.G1C22 ---------
10788.G1C22 ---------
TYPE1_10741.G1C22 ---------
TYPE1_10742.G1C22 ---------

每行一个ID，总是在每个文件的行的开头？每个file2 ID在file1中是否最多有一个匹配项？是。每行一个ID，是的，文件之间至少有一个匹配项。非常感谢。一种方法是从file1生成一个dictionary（我看到了您的python标记），其中键是要在file2中找到的匹配项，dictionary值是全名。然后读取文件2，在字典中查找

import re

# Pattern to match in File1
pattern1 = "(\w+)_(\d+\.\w+)\s+.*$"

# Pattern to match in File2
pattern2 = "(\d+\.\w+)\s+.*$"

# Load the 'master' file into a dict,
# with the number as key and 'type' as value.
file1_dict = dict()
with open("File1.txt", "r") as f:
    for line in f.readlines():
        m = re.match(pattern1, line)
        if m:
            file1_dict[m.group(2)] = m.group(1)

# Open a new output file to replace File2.txt
with open("File3.txt", "w") as fnew:
    # As you process each line in File2.txt,
    # find matching entry in above File1 list.
    # Either write the old unmatched value or new
    # matching, changed value to File3.txt
    with open("File2.txt", "r") as f:
        for line in f.readlines():
            is_found = False
            m = re.match(pattern2, line)
            if m:
                if m.group(1) in file1_dict:
                    is_found = True
                    fnew.write("{0}_{1}".format(file1_dict[m.group(1)], line))
            if not is_found:
                fnew.write(line)

# Then just overwrite File2.txt with new File3.txt contents.

# Original File1.txt
TYPE1_10737.G1C22 ---------
TYPE1_10738.G1C22 ---------
TYPE1_10739.G1C22 ---------
TYPE1_10740.G1C22 ---------
TYPE1_10741.G1C22 ---------
TYPE1_10742.G1C22 ---------
TYPE1_10799.G1C22 ---------

# Original File2.txt
10737.G1C22 ---------
10738.G1C22 ---------
10739.G1C22 ---------
10740.G1C22 ---------
10788.G1C22 ---------
10741.G1C22 ---------
10742.G1C22 ---------

# Results of new File3.txt
TYPE1_10737.G1C22 ---------
TYPE1_10738.G1C22 ---------
TYPE1_10739.G1C22 ---------
TYPE1_10740.G1C22 ---------
10788.G1C22 ---------
TYPE1_10741.G1C22 ---------
TYPE1_10742.G1C22 ---------