使用Python消除文件中的冗余_Python_Data Processing

使用Python消除文件中的冗余

python

使用Python消除文件中的冗余,python,data-processing,Python,Data Processing,如何压缩，即消除以下数据的冗余： code: GB-ENG, jobs: 2673 code: GB-ENG, jobs: 23 code: GB-ENG, jobs: 459 code: GB-ENG, jobs: 346 code: RO-B, jobs: 9 code: DE-NW, jobs: 4 code: DE-BW, jobs: 3 code: DE-BY, jobs: 9 code: DE-HH, jobs: 34 code: DE-BY, jobs: 11 code: BE-B

如何压缩，即消除以下数据的冗余：

code: GB-ENG, jobs: 2673
code: GB-ENG, jobs: 23
code: GB-ENG, jobs: 459
code: GB-ENG, jobs: 346
code: RO-B, jobs: 9
code: DE-NW, jobs: 4
code: DE-BW, jobs: 3
code: DE-BY, jobs: 9
code: DE-HH, jobs: 34
code: DE-BY, jobs: 11
code: BE-BRU, jobs: 27
code: GB-ENG, jobs: 20

输出应如下所示：

GB-ENG, 3521
RO-B, 9
DE-NW, 4
DE-BW, 3
DE-HH, 34
DE-BY, 20
BE-BRU, 27

由每个代码的1个规范表示形式描述，即，

DE-by

，表示与该代码的每个实例相关联的数字的总和，例如：

code: DE-BY, jobs: 11
code: DE-BY, jobs: 9

变成

DE-BY, 20

目前，我正在使用以下Python脚本创建输入：

import json
import requests
from collections import defaultdict
from pprint import pprint

def hasNumbers(inputString):
    return any(char.isdigit() for char in inputString)

# open up the output of 'data-processing.py'
with open('job-numbers-by-location.txt') as data_file:

    # print the output to a file
    with open('phase_ii_output.txt', 'w') as output_file_:
        for line in data_file:
            identifier, name, coords, number_of_jobs = line.split("|")
            coords = coords[1:-1]
            lat, lng = coords.split(",")
            # print("lat: " + lat, "lng: " + lng)
            response = requests.get("http://api.geonames.org/countrySubdivisionJSON?lat="+lat+"&lng="+lng+"&username=s.matthew.english").json()


            codes = response.get('codes', [])
            for code in codes:
                if code.get('type') == 'ISO3166-2':
                    country_code = '{}-{}'.format(response.get('countryCode', 'UNKNOWN'), code.get('code', 'UNKNOWN'))
                    if not hasNumbers( country_code ):
                        # print("code: " + country_code + ", jobs: " + number_of_jobs)
                        output_file_.write("code: " + country_code + ", jobs: " + number_of_jobs)
    output_file_.close()

将此功能作为脚本的一部分可能是最有效的，但我还无法弄清楚如何使用。

下面的代码使用了当前代码中使用的

dict.get（）

方法来实现计数器。这是基于从当前的

.txt

文件中读取值，但您可以简单地绕过写入文件，然后使用类似的方法读取

tally = {}

with open('country_codes.txt', 'r') as infile, open('condensed.txt', 'w') as outfile:
    for line in infile:
        data = line.strip('\n')
        tag1, code, tag2, num = data.split()
        tally[code] = tally.get(code, 0) + int(num)
    for key, value in tally.items(): # Use .iteritems() for Python 2.x
        outfile.write(' '.join(map(str, [key, value, '\n'])))

这将获取具有以下结构的文件（

country\u codes.txt

）：

code: GB-ENG, jobs: 2673
code: GB-ENG, jobs: 23
code: GB-ENG, jobs: 459
code: GB-ENG, jobs: 346
code: RO-B, jobs: 9
code: DE-NW, jobs: 4
code: DE-BW, jobs: 3
code: DE-BY, jobs: 9
code: DE-HH, jobs: 34
code: DE-BY, jobs: 11
code: BE-BRU, jobs: 27
code: GB-ENG, jobs: 20

并将其写入

condensed.txt

，如下所示：

DE-BY, 20 
DE-HH, 34 
DE-BW, 3 
DE-NW, 4 
RO-B, 9 
GB-ENG, 3521 
BE-BRU, 27

下面的代码使用当前代码中使用的

dict.get（）