Bash 基于file1在file2中搜索字符串并替换_Bash_Awk_Sed

Bash 基于file1在file2中搜索字符串并替换

bash awk sed

Bash 基于file1在file2中搜索字符串并替换,bash,awk,sed,Bash,Awk,Sed,我不熟悉shell脚本，需要您就一个典型需求提供指导。我有两个文件（1.master文件和2.pattern文件）主文件包含许多带有|分隔符的字段，并且只需要根据模式文件更新第10个和第15个字段主文件：模式文件：比如说, 123|1|2|3|...|9|nice weather in europe today|11|..... 需要将上面的行替换为 123|1|2|3|...|9|nice weather in EU today|11|..... 我从一个简单的sed命令开始，通

我不熟悉shell脚本，需要您就一个典型需求提供指导。我有两个文件（1.master文件和2.pattern文件）主文件包含许多带有|分隔符的字段，并且只需要根据模式文件更新第10个和第15个字段

主文件：模式文件：比如说,

123|1|2|3|...|9|nice weather in europe today|11|.....

需要将上面的行替换为

123|1|2|3|...|9|nice weather in EU today|11|.....

我从一个简单的sed命令开始，通过从模式文件获取值来替换主文件。。但它是不完整的，因为我不知道如何处理一个巨大的主文件这也取代了特定字段

while read line

do

value1=$(echo $line | awk -F"|" '{print $1}')

value2=$(echo $line | awk -F"|" '{print $2}')

sed -i 's/ '${value1}' /'${value2}'/g' master.txt

done < pattern.txt

读取行时
做
value1=$（echo$行| awk-F“|”{print$1}）
value2=$（echo$行| awk-F“|”{print$2}）
sed-i的/'${value1}'/'${value2}'/g'master.txt
完成


上面的脚本对于10mb的文件来说非常慢，因为我的主文件有点大（100MB）
请提供帮助。
由于您正在创建的子进程数量太多，脚本可能速度太慢。此外，您读取较大文件（master.txt
）的次数比读取较小文件的次数多
请注意，sed
的-i
选项是非标准的
通过使用bash
，您可以摆脱对awk
语言解释器和sed
编辑器的调用：
# Read patterns into an associative array
# Requites Bash 4 or later
declare -A patterns

while IFS='|' read key value
do
    patterns[$key]="$value"

done < pattern.txt 

# Set the option for case insensitive patterns
shopt -s nocasematch

while read line
do
    # Iterate through the patterns array
    for key in "${!patterns[@]}"
    do 
        line="${line//$key/${patterns[$key]}}"
    done  

    echo "$line"

done < master.txt

#将图案读入关联数组
#回馈Bash 4或更高版本
声明-A模式
当IFS=“|”读取键值时
做
模式[$key]=“$value”
完成

这不允许仅编辑某些字段。这是：
# Read patterns into an associative array
# Requites Bash 4 or later
declare -A patterns

while IFS='|' read key value
do
    patterns[$key]="$value"

done < pattern.txt

# Set the option for case insensitive patterns
shopt -s nocasematch

# IFS is set here because localised setting for 'echo' does not work in bash
oldIFS="$IFS"
IFS='|'

# "line" is an array
while read -a line
do
    # Check there are at least 15 fields
    if (( ${#line[@]} >= 15 ))
    then
        # Iterate through the patterns array
        for key in "${!patterns[@]}"
        do
            # We are only interested in the 10th and 15th fields
            # (index 9 and 14 since arrays index from zero)
            val="${line[9]}"
            line[9]="${val//$key/${patterns[$key]}}"
            val="${line[14]}"
            line[14]="${val//$key/${patterns[$key]}}"
        done
    fi
    echo "${line[*]}"

done < master.txt

IFS="$oldIFS"

#将图案读入关联数组
#回馈Bash 4或更高版本
声明-A模式
当IFS=“|”读取键值时
做
模式[$key]=“$value”
完成=15））
然后
#遍历patterns数组
对于“${！patterns[@]}”中的键
做
#我们只对第10和第15个领域感兴趣
#（索引9和14，因为数组从零开始索引）
val=“${line[9]}”
第[9]行=“${val//$key/${patterns[$key]}”
val=“${line[14]}”
第[14]行=“${val//$key/${patterns[$key]}”
完成
fi
回显“${line[*]}”
完成
由于您正在创建的子进程数量太多，脚本可能很慢。此外，您读取较大文件（master.txt
）的次数比读取较小文件的次数多
请注意，sed
的-i
选项是非标准的
通过使用bash
，您可以摆脱对awk
语言解释器和sed
编辑器的调用：
# Read patterns into an associative array
# Requites Bash 4 or later
declare -A patterns

while IFS='|' read key value
do
    patterns[$key]="$value"

done < pattern.txt 

# Set the option for case insensitive patterns
shopt -s nocasematch

while read line
do
    # Iterate through the patterns array
    for key in "${!patterns[@]}"
    do 
        line="${line//$key/${patterns[$key]}}"
    done  

    echo "$line"

done < master.txt

#将图案读入关联数组
#回馈Bash 4或更高版本
声明-A模式
当IFS=“|”读取键值时
做
模式[$key]=“$value”
完成

这不允许仅编辑某些字段。这是：
# Read patterns into an associative array
# Requites Bash 4 or later
declare -A patterns

while IFS='|' read key value
do
    patterns[$key]="$value"

done < pattern.txt

# Set the option for case insensitive patterns
shopt -s nocasematch

# IFS is set here because localised setting for 'echo' does not work in bash
oldIFS="$IFS"
IFS='|'

# "line" is an array
while read -a line
do
    # Check there are at least 15 fields
    if (( ${#line[@]} >= 15 ))
    then
        # Iterate through the patterns array
        for key in "${!patterns[@]}"
        do
            # We are only interested in the 10th and 15th fields
            # (index 9 and 14 since arrays index from zero)
            val="${line[9]}"
            line[9]="${val//$key/${patterns[$key]}}"
            val="${line[14]}"
            line[14]="${val//$key/${patterns[$key]}}"
        done
    fi
    echo "${line[*]}"

done < master.txt

IFS="$oldIFS"

#将图案读入关联数组
#回馈Bash 4或更高版本
声明-A模式
当IFS=“|”读取键值时
做
模式[$key]=“$value”
完成=15））
然后
#遍历patterns数组
对于“${！patterns[@]}”中的键
做
#我们只对第10和第15个领域感兴趣
#（索引9和14，因为数组从零开始索引）
val=“${line[9]}”
第[9]行=“${val//$key/${patterns[$key]}”
val=“${line[14]}”
第[14]行=“${val//$key/${patterns[$key]}”
完成
fi
回显“${line[*]}”
完成
这是一个sed替代方案，基于sed可以从文件中读取命令这一事实
首先，我使用模式文件的内容创建一个sed命令文件：
$ cat file1
europe|EU
australia|AU
china|CN

$ while IFS="|" read -r a b;do 
> echo -e "s/((.[^|]*.){9})(.+)\<$a\>([^|]+)(.*)/\1\3$b\4\5/g";
> echo -e "s/((.[^|]*.){14})(.+)\<$a\>([^|]+)(.*)/\1\3$b\4\5/g";
> done<file1 >file11

$ cat file11
s/((.[^|]*.){9})(.+)\<europe\>([^|]+)(.*)/\1\3EU\4\5/g
s/((.[^|]*.){14})(.+)\<europe\>([^|]+)(.*)/\1\3EU\4\5/g
s/((.[^|]*.){9})(.+)\<australia\>([^|]+)(.*)/\1\3AU\4\5/g
s/((.[^|]*.){14})(.+)\<australia\>([^|]+)(.*)/\1\3AU\4\5/g
s/((.[^|]*.){9})(.+)\<china\>([^|]+)(.*)/\1\3CN\4\5/g
s/((.[^|]*.){14})(.+)\<china\>([^|]+)(.*)/\1\3CN\4\5/g

我已经用各种测试值填写了文件2，并确保提供的sed regex将仅替换第10和第15个字段，并且仅当我们有文字匹配时（即，单词europe
被EU
替换，但单词european
未被替换）
这些结果看起来相当不错。我希望这个sed解决方案能够非常快速地处理您的大文件
$ sed -E -f file11 file2
1|2|3|4|5|europe|7|8|9|nice weather in EU today|11|12|europe|14|nice weather in EU today|16
1|2|3|4|5|europe|7|8|9|nice european weather today|11|12|europe|14|nice european weather today|16
1|2|3|4|5|europe|7|8|9|nice weather in CN today|11|12|china|14|nice weather in CN today|16
1|2|3|4|5|europe|7|8|9|nice weather in CN today|11|12|china|14|best of chinas today|16
1|2|3|4|5|europe|7|8|9|nice weather in AU today|11|12|australia|14|nice weather in AU today|16

这是一个sed替代方案，基于sed可以从文件中读取命令这一事实
首先，我使用模式文件的内容创建一个sed命令文件：
$ cat file1
europe|EU
australia|AU
china|CN

$ while IFS="|" read -r a b;do 
> echo -e "s/((.[^|]*.){9})(.+)\<$a\>([^|]+)(.*)/\1\3$b\4\5/g";
> echo -e "s/((.[^|]*.){14})(.+)\<$a\>([^|]+)(.*)/\1\3$b\4\5/g";
> done<file1 >file11

$ cat file11
s/((.[^|]*.){9})(.+)\<europe\>([^|]+)(.*)/\1\3EU\4\5/g
s/((.[^|]*.){14})(.+)\<europe\>([^|]+)(.*)/\1\3EU\4\5/g
s/((.[^|]*.){9})(.+)\<australia\>([^|]+)(.*)/\1\3AU\4\5/g
s/((.[^|]*.){14})(.+)\<australia\>([^|]+)(.*)/\1\3AU\4\5/g
s/((.[^|]*.){9})(.+)\<china\>([^|]+)(.*)/\1\3CN\4\5/g
s/((.[^|]*.){14})(.+)\<china\>([^|]+)(.*)/\1\3CN\4\5/g

我已经用各种测试值填写了文件2，并确保提供的sed regex将仅替换第10和第15个字段，并且仅当我们有文字匹配时（即，单词europe
被EU
替换，但单词european
未被替换）
这些结果看起来相当不错。我希望这个sed解决方案能够非常快速地处理您的大文件
$ sed -E -f file11 file2
1|2|3|4|5|europe|7|8|9|nice weather in EU today|11|12|europe|14|nice weather in EU today|16
1|2|3|4|5|europe|7|8|9|nice european weather today|11|12|europe|14|nice european weather today|16
1|2|3|4|5|europe|7|8|9|nice weather in CN today|11|12|china|14|nice weather in CN today|16
1|2|3|4|5|europe|7|8|9|nice weather in CN today|11|12|china|14|best of chinas today|16
1|2|3|4|5|europe|7|8|9|nice weather in AU today|11|12|australia|14|nice weather in AU today|16

这里是一个黑暗的镜头，因为您的示例数据甚至没有10个字段，我也没有时间创建tes