Python 仅基于特定键/值查找重复项
我试图使用Python标记JSON中重复的对象,只基于“price”和“full address”的键/值,而忽略“url”。然后创建一个新的“重复”键,每个重复键的值为1或2。如何才能最好地做到这一点? 当前:Python 仅基于特定键/值查找重复项,python,json,python-3.x,Python,Json,Python 3.x,我试图使用Python标记JSON中重复的对象,只基于“price”和“full address”的键/值,而忽略“url”。然后创建一个新的“重复”键,每个重复键的值为1或2。如何才能最好地做到这一点? 当前: A=[ { "url": "google.com", "price": 550, "full address": "123 sesame st", },
A=[ {
"url": "google.com",
"price": 550,
"full address": "123 sesame st",
},
{
"url": "yahoo.com",
"price": 550,
"full address": "123 sesame st",
},
{
"url": "bing.com",
"price": 250,
"full address": "123 50th st",
}]
A=[ {
"url": "google.com",
"price": 550,
"full address": "123 sesame st",
"duplicate": 1
},
{
"url": "yahoo.com",
"price": 550,
"full address": "123 sesame st",
"duplicate": 2
},
{
"url": "bing.com",
"price": 250,
"full address": "123 50th st",
}]
[{'full address': '123 sesame st', 'duplicate': 1, 'price': 550, 'url': 'google.com'}, {'full address': '123 sesame st', 'duplicate': 2, 'price': 550, 'url': 'yahoo.com'}, {'full address': '123 50th st', 'price': 250, 'url': 'bing.com'}]
预期结果:
A=[ {
"url": "google.com",
"price": 550,
"full address": "123 sesame st",
},
{
"url": "yahoo.com",
"price": 550,
"full address": "123 sesame st",
},
{
"url": "bing.com",
"price": 250,
"full address": "123 50th st",
}]
A=[ {
"url": "google.com",
"price": 550,
"full address": "123 sesame st",
"duplicate": 1
},
{
"url": "yahoo.com",
"price": 550,
"full address": "123 sesame st",
"duplicate": 2
},
{
"url": "bing.com",
"price": 250,
"full address": "123 50th st",
}]
[{'full address': '123 sesame st', 'duplicate': 1, 'price': 550, 'url': 'google.com'}, {'full address': '123 sesame st', 'duplicate': 2, 'price': 550, 'url': 'yahoo.com'}, {'full address': '123 50th st', 'price': 250, 'url': 'bing.com'}]
保留重复项的连续计数,并再次通过以删除任何非重复项的密钥:
from collections import defaultdict
A = [
{
"url": "google.com",
"price": 550,
"full address": "123 sesame st",
},
{
"url": "yahoo.com",
"price": 550,
"full address": "123 sesame st",
},
{
"url": "bing.com",
"price": 250,
"full address": "123 50th st",
},
]
counts = defaultdict(int)
for d in A:
k = (d["price"], d["full address"])
counts[k] += 1
d["duplicate"] = counts[k]
for d in A:
if counts[(d["price"], d["full address"])] == 1:
del d["duplicate"]
print(A)
保留重复项的连续计数,并再次通过以删除任何非重复项的密钥:
from collections import defaultdict
A = [
{
"url": "google.com",
"price": 550,
"full address": "123 sesame st",
},
{
"url": "yahoo.com",
"price": 550,
"full address": "123 sesame st",
},
{
"url": "bing.com",
"price": 250,
"full address": "123 50th st",
},
]
counts = defaultdict(int)
for d in A:
k = (d["price"], d["full address"])
counts[k] += 1
d["duplicate"] = counts[k]
for d in A:
if counts[(d["price"], d["full address"])] == 1:
del d["duplicate"]
print(A)
优化答案:
A=[ {
"url": "google.com",
"price": 550,
"full address": "123 sesame st",
},
{
"url": "yahoo.com",
"price": 550,
"full address": "123 sesame st",
},
{
"url": "bing.com",
"price": 250,
"full address": "123 50th st",
}]
A=[ {
"url": "google.com",
"price": 550,
"full address": "123 sesame st",
"duplicate": 1
},
{
"url": "yahoo.com",
"price": 550,
"full address": "123 sesame st",
"duplicate": 2
},
{
"url": "bing.com",
"price": 250,
"full address": "123 50th st",
}]
[{'full address': '123 sesame st', 'duplicate': 1, 'price': 550, 'url': 'google.com'}, {'full address': '123 sesame st', 'duplicate': 2, 'price': 550, 'url': 'yahoo.com'}, {'full address': '123 50th st', 'price': 250, 'url': 'bing.com'}]
与第二次通过删除任何非重复项的键不同,仅当存在任何多次出现时才添加duplicate
键。这样,我们可以只迭代整个字典一次
from collections import defaultdict
A=[ {
"url": "google.com",
"price": 550,
"full address": "123 sesame st",
},
{
"url": "yahoo.com",
"price": 550,
"full address": "123 sesame st",
},
{
"url": "bing.com",
"price": 250,
"full address": "123 50th st",
}
]
counts = defaultdict(dict)
for index in range(len(A)):
d = A[index]
k = (d["price"], d["full address"])
counts[k]["count"] = counts[k]["count"] + 1 if counts[k].get("count") else 1
if counts[k]["count"] == 1:
counts[k]["first_occurence"] = index
else:
A[counts[k]["first_occurence"]]["duplicate"] = 1
d["duplicate"] = counts[k]["count"]
print(A)
输出:
A=[ {
"url": "google.com",
"price": 550,
"full address": "123 sesame st",
},
{
"url": "yahoo.com",
"price": 550,
"full address": "123 sesame st",
},
{
"url": "bing.com",
"price": 250,
"full address": "123 50th st",
}]
A=[ {
"url": "google.com",
"price": 550,
"full address": "123 sesame st",
"duplicate": 1
},
{
"url": "yahoo.com",
"price": 550,
"full address": "123 sesame st",
"duplicate": 2
},
{
"url": "bing.com",
"price": 250,
"full address": "123 50th st",
}]
[{'full address': '123 sesame st', 'duplicate': 1, 'price': 550, 'url': 'google.com'}, {'full address': '123 sesame st', 'duplicate': 2, 'price': 550, 'url': 'yahoo.com'}, {'full address': '123 50th st', 'price': 250, 'url': 'bing.com'}]
优化答案:
A=[ {
"url": "google.com",
"price": 550,
"full address": "123 sesame st",
},
{
"url": "yahoo.com",
"price": 550,
"full address": "123 sesame st",
},
{
"url": "bing.com",
"price": 250,
"full address": "123 50th st",
}]
A=[ {
"url": "google.com",
"price": 550,
"full address": "123 sesame st",
"duplicate": 1
},
{
"url": "yahoo.com",
"price": 550,
"full address": "123 sesame st",
"duplicate": 2
},
{
"url": "bing.com",
"price": 250,
"full address": "123 50th st",
}]
[{'full address': '123 sesame st', 'duplicate': 1, 'price': 550, 'url': 'google.com'}, {'full address': '123 sesame st', 'duplicate': 2, 'price': 550, 'url': 'yahoo.com'}, {'full address': '123 50th st', 'price': 250, 'url': 'bing.com'}]
与第二次通过删除任何非重复项的键不同,仅当存在任何多次出现时才添加duplicate
键。这样,我们可以只迭代整个字典一次
from collections import defaultdict
A=[ {
"url": "google.com",
"price": 550,
"full address": "123 sesame st",
},
{
"url": "yahoo.com",
"price": 550,
"full address": "123 sesame st",
},
{
"url": "bing.com",
"price": 250,
"full address": "123 50th st",
}
]
counts = defaultdict(dict)
for index in range(len(A)):
d = A[index]
k = (d["price"], d["full address"])
counts[k]["count"] = counts[k]["count"] + 1 if counts[k].get("count") else 1
if counts[k]["count"] == 1:
counts[k]["first_occurence"] = index
else:
A[counts[k]["first_occurence"]]["duplicate"] = 1
d["duplicate"] = counts[k]["count"]
print(A)
输出:
A=[ {
"url": "google.com",
"price": 550,
"full address": "123 sesame st",
},
{
"url": "yahoo.com",
"price": 550,
"full address": "123 sesame st",
},
{
"url": "bing.com",
"price": 250,
"full address": "123 50th st",
}]
A=[ {
"url": "google.com",
"price": 550,
"full address": "123 sesame st",
"duplicate": 1
},
{
"url": "yahoo.com",
"price": 550,
"full address": "123 sesame st",
"duplicate": 2
},
{
"url": "bing.com",
"price": 250,
"full address": "123 50th st",
}]
[{'full address': '123 sesame st', 'duplicate': 1, 'price': 550, 'url': 'google.com'}, {'full address': '123 sesame st', 'duplicate': 2, 'price': 550, 'url': 'yahoo.com'}, {'full address': '123 50th st', 'price': 250, 'url': 'bing.com'}]