File 如何查找tcl中两个大文件之间的差异？_File_File Io_Compare_Tcl

File 如何查找tcl中两个大文件之间的差异？

file file-io tcl

File 如何查找tcl中两个大文件之间的差异？,file,file-io,compare,tcl,File,File Io,Compare,Tcl,我有两个文件，其中的一些内容在这两个文件中可能是共同的。（说文件A.txt和文件B.txt）这两个文件都是已排序的文件。我需要得到文件A.txt和B.txt的区别，即，一个文件C.txt，除了两个文件中的公共内容外，它的内容都是A 我使用了典型的搜索和打印算法，即从a.txt中提取一行，在B.txt中搜索，如果找到，在C.txt中不打印任何内容，否则在C.txt中打印该行。但是，我处理的文件内容庞大，因此，它会抛出错误：未能加载太多文件。（尽管它对较小的文件很有效）有谁能提出更有效的获取

我有两个文件，其中的一些内容在这两个文件中可能是共同的。（说文件

A.txt

和文件

B.txt

）这两个文件都是已排序的文件。我需要得到文件

A.txt

和

B.txt

的区别，即，一个文件

C.txt

，除了两个文件中的公共内容外，它的内容都是A

我使用了典型的搜索和打印算法，即从

a.txt

中提取一行，在

B.txt

中搜索，如果找到，在

C.txt

中不打印任何内容，否则在

C.txt

中打印该行。但是，我处理的文件内容庞大，因此，它会抛出错误：

未能加载太多文件

。（尽管它对较小的文件很有效）

有谁能提出更有效的获取

C.txt

的方法吗？

要使用的脚本：仅限TCL

首先，

文件过多

错误表示您没有关闭频道，可能是在

B.txt

扫描仪中。解决这个问题可能是你的第一个目标。如果您有Tcl 8.6，请尝试以下帮助程序：

proc scanForLine {searchLine filename} {
    set f [open $filename]
    try {
        while {[gets $f line] >= 0} {
            if {$line eq $searchLine} {
                return true
            }
        }
        return false
    } finally {
        close $f
    }
}

但是，如果其中一个文件足够小，可以合理地放入内存，则最好将其读入哈希表（例如，字典或数组）：

这更有效，但取决于

B.txt

是否足够小

如果

A.txt

和

B.txt

都太大，那么您最好分阶段进行某种处理，在这两个阶段之间将内容写入磁盘。这越来越复杂了

set filter [open B.txt]
set fromFile A.txt

for {set tmp 0} {![eof $filter]} {incr tmp} {
    # Filter by a million lines at a time; that'll probably fit OK
    for {set i 0} {$i < 1000000} {incr i} {
        if {[gets $filter line] < 0} break
        set B($line) "dummy"
    }

    # Do the filtering
    if {$tmp} {set fromFile $toFile}
    set from [open $fromFile]
    set to [open [set toFile /tmp/[pid]_$tmp.txt] w]
    while {[gets $from line] >= 0} {
        if {![info exists B($line)]} {
            puts $to $line
        }
    }
    close $from
    close $to

    # Keep control of temporary files and data
    if {$tmp} {file delete $fromFile}
    unset B
}
close $filter
file rename $toFile C.txt

设置过滤器[打开B.txt]
从文件A.txt设置
对于{set tmp 0}{！[eof$filter]}{incr tmp}{
#一次过滤一百万行，这可能合适
对于{set i 0}{$i<1000000}{incr i}{
如果{[gets$filter line]<0}中断
集合B（$line）“虚拟”
}
#进行过滤
如果{$tmp}{set fromFile$toFile}
从[打开$fromFile]设置
设置为[打开[设置为文件/tmp/[pid]$tmp.txt]w]
而{[gets$from line]>=0}{
如果{！[info exists B（$line）]}{
将$置于$行
}
}
从关闭$
接近美元
#控制临时文件和数据
如果{$tmp}{file delete$fromFile}
取消设置B
}
关闭$filter
文件重命名$toFile C.txt

警告！我没有测试过这段代码…

对于您的“仅限tcl”限制来说太糟糕了：这正是它的用途。

set filter [open B.txt]
set fromFile A.txt

for {set tmp 0} {![eof $filter]} {incr tmp} {
    # Filter by a million lines at a time; that'll probably fit OK
    for {set i 0} {$i < 1000000} {incr i} {
        if {[gets $filter line] < 0} break
        set B($line) "dummy"
    }

    # Do the filtering
    if {$tmp} {set fromFile $toFile}
    set from [open $fromFile]
    set to [open [set toFile /tmp/[pid]_$tmp.txt] w]
    while {[gets $from line] >= 0} {
        if {![info exists B($line)]} {
            puts $to $line
        }
    }
    close $from
    close $to

    # Keep control of temporary files and data
    if {$tmp} {file delete $fromFile}
    unset B
}
close $filter
file rename $toFile C.txt