如何通过对Python 3.8中两个文件中相同关键字的值求和来组合两个非常简单的数据集(关键字、数值)?

如何通过对Python 3.8中两个文件中相同关键字的值求和来组合两个非常简单的数据集(关键字、数值)?,python,python-3.x,database,Python,Python 3.x,Database,对于未能完全理解如何完美地设置此网站上的内容,请提前道歉。因此,我的示例中的选项卡并不漂亮 我需要读取两个.txt文件的内容,每行包含一个关键字和由制表符分隔的整数值,并将两个文件之间常用关键字的值相加;然后在新文本文件中按从高到低的顺序排列关键字关联值(列出值) 假设我有两个.txt文件: eggs 25 beans 10 peas 30 oranges 15 我期望的输出类似于: peas 55 beans 45 pineapples 45 egg

对于未能完全理解如何完美地设置此网站上的内容,请提前道歉。因此,我的示例中的选项卡并不漂亮

我需要读取两个.txt文件的内容,每行包含一个关键字和由制表符分隔的整数值,并将两个文件之间常用关键字的值相加;然后在新文本文件中按从高到低的顺序排列关键字关联值(列出值)

假设我有两个.txt文件:

eggs    25
beans    10
peas    30
oranges    15
我期望的输出类似于:

peas    55
beans    45
pineapples    45
eggs    40
oranges    15
如果两个值共享,我想按字母顺序排列关键字


最有效的方法是什么?

首先,您必须读取文件,然后将其转换为数据帧并添加它们

第一个文本文件 第二个txt文件 现在,您可以简单地添加数据

    items={}
    for index1,i in enumerate(df1['items']):
       for index2,j in enumerate(df2['items']):
         if i==j:
            items[i]=int(df1['amount'][index1])+int(df2['amount'][index2])

    ## For First DataFrame
    for index,i in enumerate(df1['items']):
        if i not in items.keys():
            items[i]=df1['amount'][index]

   ## For second DataFrame
   for index,j in enumerate(df2['items']):
    if j not in items.keys():
        items[j]=df2['amount'][index]

  ## Finally making the final DataFrame
  df=pd.DataFrame(items.values(),index=items.keys()).reset_index()
  df.columns=['items','amount']
  df

使用csv和计数器模块

代码

import csv
from collections import Counter

with open('f1.txt', 'r') as f1, open('f2.txt', 'r') as f2:
  # shown input has multiple spaces between fields
  reader1 = csv.reader(f1, delimiter=' ', skipinitialspace=True)
  reader2 = csv.reader(f2, delimiter=' ', skipinitialspace=True)

  # Use dictionary comprehension to
  # convert to dictionary
  #    converting second value in each row to int
  d1 = {x[0]:int(x[1]) for x in reader1}
  d2 = {x[0]:int(x[1]) for x in reader2}

# Use Counter to add common keys
cnts = Counter(d1) + Counter(d2)

# Sort by value descending and alphabeical ascending
result = dict(sorted(cnts.items(), key=lambda kv: (-kv[1], kv[0])))
for k, v in result.items():
  print(k, v)
测试

File1.txt

eggs    25
beans    10
peas    30
oranges    15
File2.txt

eggs    15
pineapples    45
beans    35
peas    25
输出

peas 55
beans 45
pineapples 45
eggs 40
oranges 15
更新

import csv
from collections import Counter

with open('f1.txt', 'r') as f1, open('f2.txt', 'r') as f2:
  # shown input has multiple spaces between fields
  reader1 = csv.reader(f1, delimiter=' ', skipinitialspace=True)
  reader2 = csv.reader(f2, delimiter=' ', skipinitialspace=True)

  # Use dictionary comprehension to
  # convert to dictionary
  #    converting second value in each row to int
  d1 = {x[0]:int(x[1]) for x in reader1}
  d2 = {x[0]:int(x[1]) for x in reader2}

# Use Counter to add common keys
cnts = Counter(d1) + Counter(d2)

# Sort by value descending and alphabeical ascending
result = dict(sorted(cnts.items(), key=lambda kv: (-kv[1], kv[0])))
for k, v in result.items():
  print(k, v)
根据实际数据更新代码示例

问题

  • 发布的代码以多空格分隔
  • 实际数据由制表符分隔
  • 实际数据(来自注释)中的许多行未正确格式化为两列字段
  • 创建了一个函数,用于逐行遍历数据以仅检索有效字段
  • 在文件1和2的注释中使用了数据链接
代码更新

输出更新


谢谢你的回答。您能否更具体地阐述一下这段代码的作用,即
enumerate()
function?enumerate()基本上用于获取索引以及它在每次迭代中使用的值。然后我创建了一个字典,首先我在其中获取重复项,添加它们的值,并将其添加到字典中。然后我检查两个数据帧的非重复值,并将它们添加到字典中。如何将这些数据导出到具有相同制表符分隔布局的.txt文件中?使用
numpy.savetxt()
。参考此链接-此解决方案似乎仅列出df1的数据。如何使用此方法在文件中使用非ASCII字符?我已经用UTF-8打开了这两个文件,但是程序仍然在reader1}@RandallIvanCarson-中的x的行
d1={x[0]:int(x[1])处返回UnicodeDecodeError。您是否可以提供指向您的数据示例的链接(非ascii)或更新f1.txt和f2.txt中的数据?我正在合并的文件是两个不同来源的日语单词使用频率报告,因此我担心提供链接,即使它们已经免费提供,但无论如何我都会提供@RandallIvanCarson——请参阅我的更新答案(基本上是带有更新的附加部分)。不幸的是,此解决方案还返回了UnicodeDecodeError。也许我们的操作系统或其设置之间存在差异。
eggs    15
pineapples    45
beans    35
peas    25
peas 55
beans 45
pineapples 45
eggs 40
oranges 15
from collections import Counter

def get_data(filenm):
  '''
    Two column CSV tab delimited data
    vald - lines with valid data
    invalid - lines with invalid data (linenumber, data)
  '''
  valid, invalid = [], []
  with open(filenm, 'r', encoding="utf8") as f:
    for i, line in enumerate(f):
      row = line.rstrip().split('\t')
      if len(row) == 2:
        valid.append(row)  # Valid row of data
      else:
        invalid.append((i, line))

  return valid, invalid

valid1, invalid1 = get_data('agg1.txt')
valid2, invalid2 =get_data('agg2.txt')

# Convert Valid rows to dictionary
d1 = {x[0]:int(x[1]) for x in valid1}
d2 = {x[0]:int(x[1]) for x in valid2}

cnts = Counter(d1) + Counter(d2)
# Sort by value descending and alphabeical ascending
result = dict(sorted(cnts.items(), key=lambda kv: (-kv[1], kv[0])))

# Show first 10 lines of results
print('First 10 lines of results')
for i, (k, v) in enumerate(result.items()):
  print(k, v)
  if i > 10:
    break

# Show invalid data (line number and line)
print()
print('Invalid file 1')
print(*invalid1, sep = '\n')
print('Invalid file 2')
print(*invalid2, sep = '\n')
First 10 lines of results
。 6397586
を 4450628
《 2948712
》 2948688
「 2295146
」 2294570
… 1843528
だ 1530958
いる 841602
こと 761052
? 545826
する 458792

Invalid file 1
(5828, '\t\t\t946\n')
(24158, '133\n')
(24293, '132\n')
(30648, '87\n')
(37889, '58\n')
(46807, '37\n')
(51404, '\t\t\t30\n')
(53151, '27\n')
(54272, '26\n')
(54677, '25\n')
(55962, '24\n')
(57129, '23\n')
(70327, '13\n')
(71287, '12\n')
(73405, '11\n')
(76059, '10\n')
(76214, '10\n')
(82563, '8\n')
(83460, '8\n')
(85801, '7\n')
(88476, '6\n')
(88494, '6\n')
(94354, '5\n')
(94703, '5\n')
(97635, '4\n')
(110152, '3\n')
(110153, '3\n')
(110560, '3\n')
(111046, '3\n')
(117778, '2\n')
(117791, '2\n')
(117795, '\t\uf8f3\t2\n')
(117806, '2\n')
(118312, '2\n')
(119811, '2\n')
(119848, '2\n')
(134106, '1\n')
(134485, '1\n')
(134505, '1\n')
(136092, '1\n')
(136144, '1\n')
(136147, '1\n')
(139521, '1\n')
(139626, '1\n')
(139629, '1\n')
(139645, '1\n')
(139665, '1\n')
(139724, '1\n')
(139877, '1\n')
(139885, '1\n')
(139887, '1\n')
(139897, '1\n')
(139914, '1\n')
(139935, '1\n')
(139936, '1\n')
(139963, '1\n')
(139975, '1\n')
Invalid file 2
(5828, '\t\t\t946\n')
(24158, '133\n')
(24293, '132\n')
(30648, '87\n')
(37889, '58\n')
(46807, '37\n')
(51404, '\t\t\t30\n')
(53151, '27\n')
(54272, '26\n')
(54677, '25\n')
(55962, '24\n')
(57129, '23\n')
(70327, '13\n')
(71287, '12\n')
(73405, '11\n')
(76059, '10\n')
(76214, '10\n')
(82563, '8\n')
(83460, '8\n')
(85801, '7\n')
(88476, '6\n')
(88494, '6\n')
(94354, '5\n')
(94703, '5\n')
(97635, '4\n')
(110152, '3\n')
(110153, '3\n')
(110560, '3\n')
(111046, '3\n')
(117778, '2\n')
(117791, '2\n')
(117795, '\t\uf8f3\t2\n')
(117806, '2\n')
(118312, '2\n')
(119811, '2\n')
(119848, '2\n')
(134106, '1\n')
(134485, '1\n')
(134505, '1\n')
(136092, '1\n')
(136144, '1\n')
(136147, '1\n')
(139521, '1\n')
(139626, '1\n')
(139629, '1\n')
(139645, '1\n')
(139665, '1\n')
(139724, '1\n')
(139877, '1\n')
(139885, '1\n')
(139887, '1\n')
(139897, '1\n')
(139914, '1\n')
(139935, '1\n')
(139936, '1\n')
(139963, '1\n')
(139975, '1\n')