使用Python消除文件中的冗余

使用Python消除文件中的冗余,python,data-processing,Python,Data Processing,如何压缩,即消除以下数据的冗余: code: GB-ENG, jobs: 2673 code: GB-ENG, jobs: 23 code: GB-ENG, jobs: 459 code: GB-ENG, jobs: 346 code: RO-B, jobs: 9 code: DE-NW, jobs: 4 code: DE-BW, jobs: 3 code: DE-BY, jobs: 9 code: DE-HH, jobs: 34 code: DE-BY, jobs: 11 code: BE-B

如何压缩,即消除以下数据的冗余:

code: GB-ENG, jobs: 2673
code: GB-ENG, jobs: 23
code: GB-ENG, jobs: 459
code: GB-ENG, jobs: 346
code: RO-B, jobs: 9
code: DE-NW, jobs: 4
code: DE-BW, jobs: 3
code: DE-BY, jobs: 9
code: DE-HH, jobs: 34
code: DE-BY, jobs: 11
code: BE-BRU, jobs: 27
code: GB-ENG, jobs: 20
输出应如下所示:

GB-ENG, 3521
RO-B, 9
DE-NW, 4
DE-BW, 3
DE-HH, 34
DE-BY, 20
BE-BRU, 27
由每个代码的1个规范表示形式描述,即,
DE-by
,表示与该代码的每个实例相关联的数字的总和,例如:

code: DE-BY, jobs: 11
code: DE-BY, jobs: 9
变成

DE-BY, 20
目前,我正在使用以下Python脚本创建输入:

import json
import requests
from collections import defaultdict
from pprint import pprint

def hasNumbers(inputString):
    return any(char.isdigit() for char in inputString)

# open up the output of 'data-processing.py'
with open('job-numbers-by-location.txt') as data_file:

    # print the output to a file
    with open('phase_ii_output.txt', 'w') as output_file_:
        for line in data_file:
            identifier, name, coords, number_of_jobs = line.split("|")
            coords = coords[1:-1]
            lat, lng = coords.split(",")
            # print("lat: " + lat, "lng: " + lng)
            response = requests.get("http://api.geonames.org/countrySubdivisionJSON?lat="+lat+"&lng="+lng+"&username=s.matthew.english").json()


            codes = response.get('codes', [])
            for code in codes:
                if code.get('type') == 'ISO3166-2':
                    country_code = '{}-{}'.format(response.get('countryCode', 'UNKNOWN'), code.get('code', 'UNKNOWN'))
                    if not hasNumbers( country_code ):
                        # print("code: " + country_code + ", jobs: " + number_of_jobs)
                        output_file_.write("code: " + country_code + ", jobs: " + number_of_jobs)
    output_file_.close()

将此功能作为脚本的一部分可能是最有效的,但我还无法弄清楚如何使用。

下面的代码使用了当前代码中使用的
dict.get()
方法来实现计数器。这是基于从当前的
.txt
文件中读取值,但您可以简单地绕过写入文件,然后使用类似的方法读取

tally = {}

with open('country_codes.txt', 'r') as infile, open('condensed.txt', 'w') as outfile:
    for line in infile:
        data = line.strip('\n')
        tag1, code, tag2, num = data.split()
        tally[code] = tally.get(code, 0) + int(num)
    for key, value in tally.items(): # Use .iteritems() for Python 2.x
        outfile.write(' '.join(map(str, [key, value, '\n'])))
这将获取具有以下结构的文件(
country\u codes.txt
):

code: GB-ENG, jobs: 2673
code: GB-ENG, jobs: 23
code: GB-ENG, jobs: 459
code: GB-ENG, jobs: 346
code: RO-B, jobs: 9
code: DE-NW, jobs: 4
code: DE-BW, jobs: 3
code: DE-BY, jobs: 9
code: DE-HH, jobs: 34
code: DE-BY, jobs: 11
code: BE-BRU, jobs: 27
code: GB-ENG, jobs: 20
并将其写入
condensed.txt
,如下所示:

DE-BY, 20 
DE-HH, 34 
DE-BW, 3 
DE-NW, 4 
RO-B, 9 
GB-ENG, 3521 
BE-BRU, 27

下面的代码使用当前代码中使用的
dict.get()
方法来实现计数器。这是基于从当前的
.txt
文件中读取值,但您可以简单地绕过写入文件,然后使用类似的方法读取

tally = {}

with open('country_codes.txt', 'r') as infile, open('condensed.txt', 'w') as outfile:
    for line in infile:
        data = line.strip('\n')
        tag1, code, tag2, num = data.split()
        tally[code] = tally.get(code, 0) + int(num)
    for key, value in tally.items(): # Use .iteritems() for Python 2.x
        outfile.write(' '.join(map(str, [key, value, '\n'])))
这将获取具有以下结构的文件(
country\u codes.txt
):

code: GB-ENG, jobs: 2673
code: GB-ENG, jobs: 23
code: GB-ENG, jobs: 459
code: GB-ENG, jobs: 346
code: RO-B, jobs: 9
code: DE-NW, jobs: 4
code: DE-BW, jobs: 3
code: DE-BY, jobs: 9
code: DE-HH, jobs: 34
code: DE-BY, jobs: 11
code: BE-BRU, jobs: 27
code: GB-ENG, jobs: 20
并将其写入
condensed.txt
,如下所示:

DE-BY, 20 
DE-HH, 34 
DE-BW, 3 
DE-NW, 4 
RO-B, 9 
GB-ENG, 3521 
BE-BRU, 27

你可以这样做:

data = """code: GB-ENG, jobs: 2673
code: GB-ENG, jobs: 23
code: GB-ENG, jobs: 459
code: GB-ENG, jobs: 346
code: RO-B, jobs: 9
code: DE-NW, jobs: 4
code: DE-BW, jobs: 3
code: DE-BY, jobs: 9
code: DE-HH, jobs: 34
code: DE-BY, jobs: 11
code: BE-BRU, jobs: 27
code: GB-ENG, jobs: 20"""


final_data = {}

for code, count in [_.strip('code: ').split(', jobs: ') for _ in data.split('\n')]:
    if code in final_data:
        final_data[code]['amount'] += int(count)

    else:
        final_data[code] = {'amount': int(count)}

for key, value in final_data.items():
    print('code: {}, jobs: {}'.format(key, value['amount']))

你可以这样做:

data = """code: GB-ENG, jobs: 2673
code: GB-ENG, jobs: 23
code: GB-ENG, jobs: 459
code: GB-ENG, jobs: 346
code: RO-B, jobs: 9
code: DE-NW, jobs: 4
code: DE-BW, jobs: 3
code: DE-BY, jobs: 9
code: DE-HH, jobs: 34
code: DE-BY, jobs: 11
code: BE-BRU, jobs: 27
code: GB-ENG, jobs: 20"""


final_data = {}

for code, count in [_.strip('code: ').split(', jobs: ') for _ in data.split('\n')]:
    if code in final_data:
        final_data[code]['amount'] += int(count)

    else:
        final_data[code] = {'amount': int(count)}

for key, value in final_data.items():
    print('code: {}, jobs: {}'.format(key, value['amount']))
导入系统,重新
从集合导入defaultdict
tally=defaultdict(整数)
对于sys.stdin中的行:
match=re.match(r'^code:(?P
\S+),作业:(?P\d+),行)。groupdict()
计数[匹配[“代码”]+=int(匹配[“作业”])
对于代码,tally.iteritems()中的作业:
打印“{},{}”。格式(代码、作业)
导入系统,重新
从集合导入defaultdict
tally=defaultdict(整数)
对于sys.stdin中的行:
match=re.match(r'^code:(?P
\S+),作业:(?P\d+),行)。groupdict()
计数[匹配[“代码”]+=int(匹配[“作业”])
对于代码,tally.iteritems()中的作业:
打印“{},{}”。格式(代码、作业)

假设您的countries.txt格式如下

code: GB-ENG jobs: 2673
code: GB-ENG jobs: 23
code: GB-ENG jobs: 459
code: GB-ENG jobs: 346
code: RO-B jobs: 9
code: DE-NW jobs: 4
code: DE-BW jobs: 3
code: DE-BY jobs: 9
code: DE-HH jobs: 34
code: DE-BY jobs: 11
code: BE-BRU jobs: 27
code: GB-ENG jobs: 20
代码片段

with open('countries.txt') as input_file, open('phase_ii_output.txt', 'w') as output_file:
            args = []
            dic = {}
            for line in input_file:
                args.append(line.split(" "))
            for n in args:
                key = n[1]
                num = int(n[3].rstrip())
                if key in dic:
                    dic[key] += num
                else:
                    dic[key] = num
            output_file.write(dic)
输出

{'BE-BRU': 27, 'DE-BY': 20, 'DE-NW': 4, 'DE-BW': 3, 'RO-B': 9, 'GB-ENG': 3521, 'DE-HH': 34}

这假设您的countries.txt格式如下

code: GB-ENG jobs: 2673
code: GB-ENG jobs: 23
code: GB-ENG jobs: 459
code: GB-ENG jobs: 346
code: RO-B jobs: 9
code: DE-NW jobs: 4
code: DE-BW jobs: 3
code: DE-BY jobs: 9
code: DE-HH jobs: 34
code: DE-BY jobs: 11
code: BE-BRU jobs: 27
code: GB-ENG jobs: 20
代码片段

with open('countries.txt') as input_file, open('phase_ii_output.txt', 'w') as output_file:
            args = []
            dic = {}
            for line in input_file:
                args.append(line.split(" "))
            for n in args:
                key = n[1]
                num = int(n[3].rstrip())
                if key in dic:
                    dic[key] += num
                else:
                    dic[key] = num
            output_file.write(dic)
输出

{'BE-BRU': 27, 'DE-BY': 20, 'DE-NW': 4, 'DE-BW': 3, 'RO-B': 9, 'GB-ENG': 3521, 'DE-HH': 34}

假设文本存储在一个文本文件中,这将起作用

infile = open('redundancy.txt','r')
a= infile.readlines()
print a
d={}
for item in a:
    c=item.strip('\n')    
    b=c.split()    
    if b[1] in d :
        d[b[1]]= int(d.get(b[1]))+eval((b[3]))
    else:
        d[b[1]]=b[3]
print d
这将产生一个结果:

{'DE-BY,': 20, 'DE-HH,': '34', 'DE-BW,': '3', 'DE-NW,': '4', 'RO-B,': '9', 'GB-ENG,': 3521, 'BE-BRU,': '27'}

假设文本存储在一个文本文件中,这将起作用

infile = open('redundancy.txt','r')
a= infile.readlines()
print a
d={}
for item in a:
    c=item.strip('\n')    
    b=c.split()    
    if b[1] in d :
        d[b[1]]= int(d.get(b[1]))+eval((b[3]))
    else:
        d[b[1]]=b[3]
print d
这将产生一个结果:

{'DE-BY,': 20, 'DE-HH,': '34', 'DE-BW,': '3', 'DE-NW,': '4', 'RO-B,': '9', 'GB-ENG,': 3521, 'BE-BRU,': '27'}


您可以尝试使用python计数器,其中键是代码,值是作业数?这看起来像什么?您可能只需要使用标准的UNIX工具和命令行就可以轻松地完成这项工作。您可以自由地接受您选择的任何答案,我通常不会介意。然而,您已经取消了我的选择,并接受了一个需要
eval
的答案,尽管它是Python最糟糕的特性之一,即可以完全擦除整个系统?看见我对你的逻辑感兴趣。。。我看不出它比我原来的答案有什么改进,它只是比较新。它不使用上下文管理器来处理文件打开(实际上,文件最后甚至没有关闭),也不写入任何输出,你可以尝试使用python计数器,其中键是代码,值是作业数。你的意思是不是不写文件?这看起来像什么?您可能只需要使用标准的UNIX工具和命令行就可以轻松地完成这项工作。您可以自由地接受您选择的任何答案,我通常不会介意。然而,您已经取消了我的选择,并接受了一个需要
eval
的答案,尽管它是Python最糟糕的特性之一,即可以完全擦除整个系统?看见我对你的逻辑感兴趣。。。我看不出它比我原来的答案有什么改进,它只是比较新。它不使用上下文管理器来处理文件打开(事实上,文件最后甚至没有关闭),也不写任何输出,只给出一个字典。那么-这是它自己的文件吗?我不确定你在问什么。目前我认为您正在将非压缩输出写入
phase_ii_output.txt
?基于这种方法,您必须将数据读回(正如我在这里所做的)并再次处理它以获得所需的输出。但是完全删除写入
phase_ii_output.txt
并不需要太多修改。这个想法就是使用
tally[code]=tally.get(code,0)+int(num)
来总结作业。这个“dict”没有属性“iterItems”@s.matthew.english哦,它在Python 3中被更改了。将
iteritems()
更改为仅
items()
。我会编辑。那么-这是它自己的文件?我不知道你在问什么。目前我认为您正在将非压缩输出写入
phase_ii_output.txt
?基于这种方法,您必须将数据读回(正如我在这里所做的)并再次处理它以获得所需的输出。但是完全删除写入
phase_ii_output.txt
并不需要太多修改。这个想法就是使用
tally[code]=tally.get(code,0)+int(num)
来总结作业。这个“dict”没有属性“iterItems”@s.matthew.english哦,它在Python 3中被更改了。将
iteritems()
更改为仅
items()
。我会编辑。