Python 为某些选择性记录合并主数据库中的其他列信息_Python_Python 2.7_Dictionary_Merge_Inner Join

Python 为某些选择性记录合并主数据库中的其他列信息

python python-2.7 dictionary merge

Python 为某些选择性记录合并主数据库中的其他列信息,python,python-2.7,dictionary,merge,inner-join,Python,Python 2.7,Dictionary,Merge,Inner Join,我正在处理两个包含数百万条记录的文件。只是分享测试数据来解释我面临的问题。例如，tx_match.txt包含所有记录。txid_time.txt只有几个带有时间戳的记录。我想要的输出如下所示，其想法是合并来自主数据库的附加列信息。请注意，我不允许使用熊猫图书馆 tx_match.txt col1 col2 col3 col4 171 9 9 5000000000 183 171 9 4000000000 185 183 9 30

我正在处理两个包含数百万条记录的文件。只是分享测试数据来解释我面临的问题。例如，tx_match.txt包含所有记录。txid_time.txt只有几个带有时间戳的记录。我想要的输出如下所示，其想法是合并来自主数据库的附加列信息。请注意，我不允许使用熊猫图书馆

tx_match.txt

col1  col2  col3      col4
171    9    9    5000000000
183    171    9    4000000000
185    183    9    3000000000
187    185    9    2900000000
192    187    187  100000000
227    185    185  100000000
255    187    9    2800000000
504    367    367  5000000000
504    192    192  100000000
504    255    255  1000000000
533    293    293  5000000000
555    533    533  2500000000

txid_time.txt

col1      col2
227     2017-02-10
255     2017-01-10
504     2017-02-09

我期望的输出是：

227    185     185     100000000   2017-02-10
255    187     9       2800000000  2017-01-10 
504    367     367     5000000000  2017-02-09
504    192     192     100000000   2017-02-09
504    255     255     1000000000  2017-02-09

到目前为止，我已经做到了：

import csv 
d={}
fin = open("txid_match.txt","r")
for line in fin:
    try:
        line = line.rstrip()
        f = line.split("\t")
        k=f[0]
        v=f[1]
        d[k]=v
    except IndexError:
        continue

fin.close()
#print(d)
fin = open("txid_time.txt","r")
fout = open("txmatch_time.txt",'w')
foutWriter=csv.writer(fout)
for line in fin:
    try:
         line = line.rstrip()
         f = line.split("\t")
         txid=f[0]
         prvtxid=d[txid]    
         foutWriter.writerow([f[0]+"\t"+f[1]+"\t"+prvtxid])
    except IndexError:
         continue
    except KeyError:
         continue
fin.close()    
fout.close()

提前感谢您的支持

您的解决方案将起作用。然而，它需要最佳情况下的线性空间复杂度。以下解决方案对其进行了改进，以实现最佳情况下的恒定空间复杂度。它还更好地利用了自动上下文管理器（带有语句的

），以及csv
包的读取器
和编写器
的自动解析和连接功能。（注意，为了清晰起见，我省略了索引器
和键错误
处理；如果需要，您可能需要自己添加它们）
导入csv
col_delim='\t'
行\u delim='\n'
以open（'txid_time.txt'，'r'）作为ftime，open（'tx_match.txt'，'r'）作为fmatch，open（'txmatch_time.txt'，'w'）作为fmerge：
rtime=csv.reader（ftime，delimeter=col\u delim，lineterminator=row\u delim）
rmatch=csv.reader（fmatch，delimeter=col\u delim，lineterminator=row\u delim）
wmerge=csv.writer（fmerge，delimeter=col\u delim，lineterminator=row\u delim）
尝试：
时间=下一个（rtime）
匹配=下一个（rmatch）
继续=真
继续时：
当时间[0]<匹配[0]：
时间=下一个（rtime）
当时间[0]>匹配[0]时：
匹配=下一个（rmatch）
如果时间[0]==匹配[0]：
键=时间[0]
时间=[]
尝试：
而时间[0]==键：
times.append（时间）
时间=下一个（rtime）
除停止迭代外：
继续=错误
匹配项=[]
尝试：
而匹配[0]==键：
匹配。追加（匹配）
匹配=下一个（rmatch）
除停止迭代外：
继续=错误
对于匹配中的匹配：
以时间为单位：
合并=匹配+时间[1:2]
wmerge.writerow（合并）
除停止迭代外：
通过
您不允许使用任何库吗？或者仅仅不是熊猫库？@Abhishek，基本的python库，比如csv，我可以使用。您有没有没有没有没有任何库的解决方案？期待看到它，谢谢。您可以编辑.txt文件吗？e、 g.我将移除所有空间，并用；并添加标题。这将使读取.txt文件变得更容易。@J.a.Cado，很抱歉弄乱了间距，因为最初我删除了一些列。现在，我刚刚添加了标题，并在列之间添加了4个间距。谢谢
import csv

col_delim = '\t'
row_delim = '\n'

with open('txid_time.txt', 'r') as ftime, open('tx_match.txt', 'r') as fmatch, open('txmatch_time.txt', 'w') as fmerge:
    rtime = csv.reader(ftime, delimeter=col_delim, lineterminator=row_delim)
    rmatch = csv.reader(fmatch, delimeter=col_delim, lineterminator=row_delim)
    wmerge = csv.writer(fmerge, delimeter=col_delim, lineterminator=row_delim)

    try:
        time = next(rtime)
        match = next(rmatch)
        continue_ = True
        while continue_:
            while time[0] < match[0]:
                time = next(rtime)
            while time[0] > match[0]:
                match = next(rmatch)
            if time[0] == match[0]:
                key = time[0]
                times = []
                try:
                    while time[0] == key:
                        times.append(time)
                        time = next(rtime)
                except StopIteration:
                    continue_ = False
                matches = []
                try:
                    while match[0] == key:
                        matches.append(match)
                        match = next(rmatch)
                except StopIteration:
                    continue_ = False
                for match in matches:
                   for time in times:
                       merge = match + time[1:2]
                       wmerge.writerow(merge)
    except StopIteration:
        pass