jq或python脚本删除json字段中日期后的文本

jq或python脚本删除json字段中日期后的文本,python,json,jq,Python,Json,Jq,我有一个json文件,其中包含数百个条目,例如: { "url":"http://example.com/10618/", "metatag.eprints.publication":"Journal of Corporate Real Estate", "metatag.eprints.title":"Corporate Real Estate Strategy", "metatag.eprints.citation":"Adair, P, McGrogan,

我有一个json文件,其中包含数百个条目,例如:

{
    "url":"http://example.com/10618/",
    "metatag.eprints.publication":"Journal of Corporate Real Estate",
    "metatag.eprints.title":"Corporate Real Estate Strategy",
    "metatag.eprints.citation":"Adair, P, McGrogan, WS, and Webb, JR (2006) Corporate Real Estate Strategy. Journal of Corporate Real Estate"}
{
    "url":"http://example.com/23552/",
    "metatag.eprints.publication":"European Journal of Cardio-Thoracic Surgery",
    "metatag.eprints.title":"Long-term survival from coronary endarterectomies in coronary artery disease",
    "metatag.eprints.citation":"Aaron, P, Jones, K, Pallin, C, and Nash, R (2012) Long-term survival from coronary endarterectomies in coronary artery disease. European Journal of Cardio-Thoracic Surgery"}
谁能帮我写一个jq或python脚本,对每个块修改metatag.eprints.引文,以便删除日期后的所有文本

因此,上述区块将变为:

{
    "url":"http://example.com/10618/",
    "metatag.eprints.publication":"Journal of Corporate Real Estate",
    "metatag.eprints.title":"Corporate Real Estate Strategy",
    "metatag.eprints.citation":"Adair, P, McGrogan, WS, and Webb, JR (2006)"}
{
    "url":"http://example.com/23552/",
    "metatag.eprints.publication":"European Journal of Cardio-Thoracic Surgery",
    "metatag.eprints.title":"Long-term survival from coronary endarterectomies in coronary artery disease",
    "metatag.eprints.citation":"Aaron, P, Jones, K, Pallin, C, and Nash, R (2012)"}

一旦你的格式和你的问题一样,你可以使用itertools.groupby按开始的括号分组,用str.join连接行,并使用json.loads获得dict,然后只需按键访问并将更新的数据写入tempfile即可。最后,使用shutil.move替换原始文件,如果您想要一个全新的文件,只需将NamedTemporaryFile更改为使用open:

在.txt之前:

{
    "url":"http://example.com/10618/",
    "metatag.eprints.publication":"Journal of Corporate Real Estate",
    "metatag.eprints.title":"Corporate Real Estate Strategy",
    "metatag.eprints.citation":"Adair, P, McGrogan, WS, and Webb, JR (2006) Corporate Real Estate Strategy. Journal of Corporate Real Estate"}
{
    "url":"http://example.com/23552/",
    "metatag.eprints.publication":"European Journal of Cardio-Thoracic Surgery",
    "metatag.eprints.title":"Long-term survival from coronary endarterectomies in coronary artery disease",
    "metatag.eprints.citation":"Aaron, P, Jones, K, Pallin, C, and Nash, R (2012) Long-term survival from coronary endarterectomies in coronary artery disease. European Journal of Cardio-Thoracic Surgery"}
在.txt之后:

{"url": "http://example.com/10618/", "metatag.eprints.publication": "Journal of Corporate Real Estate", "metatag.eprints.citation": "Adair, P, McGrogan, WS, and Webb, JR (2006)", "metatag.eprints.title": "Corporate Real Estate Strategy"}
{"url": "http://example.com/23552/", "metatag.eprints.publication": "European Journal of Cardio-Thoracic Surgery", "metatag.eprints.citation": "Aaron, P, Jones, K, Pallin, C, and Nash, R (2012)", "metatag.eprints.title": "Long-term survival from coronary endarterectomies in coronary artery disease"}
如果您必须在以后编辑它,您可以简单地在文件上循环,json.loads每行获得一个dict,再次使用密钥更新并写入文件。每行一个会让你的生活更轻松

如果可以在日期之前有期初参数,则可以使用正则表达式搜索特定的子字符串,参数之间有4位数字:

r = re.compile("\(\d{4}\)")
for k, v in groupby(f, key=lambda x: x.lstrip().startswith("{")):
    if not k:
        d = json.loads("{" + "".join(v))
        v = d["metatag.eprints.citation"]
        d["metatag.eprints.citation"] = v[:next(r.finditer(v)).end()]
        json.dump(d, out)
        out.write("\n")
如果您得到的是一个空文件,那么您的数据实际上必须是每行一个dict,因此只需迭代文件对象并应用相同的逻辑即可:

with open("in.txt") as f, NamedTemporaryFile("w", dir=".",delete=False) as out:
    for line in f:
            d = json.loads(line)
            v = d["metatag.eprints.citation"]
            d["metatag.eprints.citation"] = v[:v.find(")")+1]
            json.dump(d, out)
            out.write("\n")
move(out.name,"in.txt")
jq.“[metatag.eprints.引文]\124;=匹配。*?\\\\.string/.”


需要jq1.5。这样做的目的是将metatag.eprints.引文的值设置为将自身匹配到正则表达式。*?\,这将匹配第一个右括号之前的所有内容。如果由于任何原因没有右括号,我们将使用可选运算符//将值设置回原来的值。

谢谢Padraic。这实际上删除了metatag.eprints.引文的条目,而我只想删除日期之后的文本。这需要一个regexp吗?@KoreMike,啊,好吧,我会编辑,除非只有括号在dat附近,否则需要一个regex?我实际上修剪了示例中的内容,在日期之后还有其他括号,所以总是在日期之后?你是对的@Padraic Cunningham。其中一个文本编辑器在整个文件中添加了一个我没有注意到的选项卡式缩进!谢谢你的密码,伙计谢谢你,Santiago Lapresta。我如何将此“cat”到新文件?我以前使用过这样的脚本:cat staff output solr index.json | jq-c'select.url | contains | not'>cleaned staff index。json@KoreMike只需将命令的输出通过管道传输到文件中即可。cat my-ught-old-file.json | jq.[metatag.eprints.引文]\124;=match.*.\\\\.string/.>my-fancy-引文.jsontanks@Victor Bjelkholm。这正是我所尝试的,但在解析“\\”错误时,我得到了一个无效的转义。因此,我删除了\中的一个,命令运行正常。然而,输出文件是空的。你能用一个测试json文件和原始问题中第一个代码块的数据来测试它吗?需要反斜杠的数量可能取决于你使用的shell的特定转义规则@维克多·杰尔霍姆的回答是正确的,我确实成功地尝试了!使用您提供的代码块。你收到错误消息了吗?@Santiago Laperesta我想我的问题是因为上面Padraic线程中提到的缩进代码。听起来像你说的解决方案是正确的。我还不能测试它,因为我需要用oniguruma重建jq
with open("in.txt") as f, NamedTemporaryFile("w", dir=".",delete=False) as out:
    for line in f:
            d = json.loads(line)
            v = d["metatag.eprints.citation"]
            d["metatag.eprints.citation"] = v[:v.find(")")+1]
            json.dump(d, out)
            out.write("\n")
move(out.name,"in.txt")