仅在XML标记内替换;使用bash命令从Referencer.reflib导出为bibtex格式,文件名保持不变,URL编码删除

仅在XML标记内替换;使用bash命令从Referencer.reflib导出为bibtex格式,文件名保持不变,URL编码删除,bash,urlencode,bibtex,bibliography,Bash,Urlencode,Bibtex,Bibliography,我有很多参考资料。从Referencer导出时,我试图在bibtex文件中包含文件名。由于软件在默认情况下不这样做,所以我尝试在导出之前使用sed命令将文件名作为bibtex信息包含在XML文件中,从而包含文件名 输入 <doc> <filename>file:///home/dwickrama/Desktop/stevenJonesLab/papers/Transcription%20Factor%20Binding/A%20Common%20Nuclear%20S

我有很多参考资料。从Referencer导出时,我试图在bibtex文件中包含文件名。由于软件在默认情况下不这样做,所以我尝试在导出之前使用sed命令将文件名作为bibtex信息包含在XML文件中,从而包含文件名

输入

  <doc>
<filename>file:///home/dwickrama/Desktop/stevenJonesLab/papers/Transcription%20Factor%20Binding/A%20Common%20Nuclear%20Signal%20Transduction%20Pathway%20Activated%20by%20Growth%20Factor%20and%20Cytokine.pdf</filename>
<relative_filename>A%20Common%20Nuclear%20Signal%20Transduction%20Pathway%20Activated%20by%20Growth%20Factor%20and%20Cytokine.pdf</relative_filename>
<key>Sadowski93</key>
<notes></notes>
<bib_type>article</bib_type>
<bib_doi></bib_doi>
<bib_title>A common nuclear signal transduction pathway activated by growth factor and cytokine receptors.</bib_title>
<bib_authors>Sadowski, H B and Shuai, K and Darnell, J E and Gilman, M Z</bib_authors>
<bib_journal>Science</bib_journal>
<bib_volume>261</bib_volume>
<bib_number>5129</bib_number>
<bib_pages>1739-44</bib_pages>
<bib_year>1993</bib_year>
<bib_extra key="pmid">8397445</bib_extra>
  <doc>
<filename>file:///home/dwickrama/Desktop/stevenJonesLab/papers/Transcription%20Factor%20Binding/A%20Common%20Nuclear%20Signal%20Transduction%20Pathway%20Activated%20by%20Growth%20Factor%20and%20Cytokine.pdf</filename>
<bib_extra key="File">article:../Transcription\ Factor\ Binding/A\ Common\ Nuclear\ Signal\ Transduction\ Pathway\ Activated\ by\ Growth\ Factor\ and\ Cytokine.pdf:pdf</bib_extra>
<relative_filename>A%20Common%20Nuclear%20Signal%20Transduction%20Pathway%20Activated%20by%20Growth%20Factor%20and%20Cytokine.pdf</relative_filename>
<key>Sadowski93</key>
<notes></notes>
<bib_type>article</bib_type>
<bib_doi></bib_doi>
<bib_title>A common nuclear signal transduction pathway activated by growth factor and cytokine receptors.</bib_title>
<bib_authors>Sadowski, H B and Shuai, K and Darnell, J E and Gilman, M Z</bib_authors>
<bib_journal>Science</bib_journal>
<bib_volume>261</bib_volume>
<bib_number>5129</bib_number>
<bib_pages>1739-44</bib_pages>
<bib_year>1993</bib_year>
<bib_extra key="pmid">8397445</bib_extra>

file:///home/dwickrama/Desktop/stevenJonesLab/papers/Transcription%20Factor%20Binding/A%20Common%20Nuclear%20Signal%20Transduction%20Pathway%20Activated%20by%20Growth%20Factor%20and%20Cytokine.pdf
A%20普通%20核%20信号%20转导%20通路%20被%20生长%20因子%20和%20细胞因子激活%20。pdf
萨多夫斯基93
文章
一种由生长因子和细胞因子受体激活的常见核信号转导途径。
萨多夫斯基,H B和帅,K和达内尔,J E和吉尔曼,M Z
科学类
261
5129
1739-44
1993
8397445

输出

  <doc>
<filename>file:///home/dwickrama/Desktop/stevenJonesLab/papers/Transcription%20Factor%20Binding/A%20Common%20Nuclear%20Signal%20Transduction%20Pathway%20Activated%20by%20Growth%20Factor%20and%20Cytokine.pdf</filename>
<relative_filename>A%20Common%20Nuclear%20Signal%20Transduction%20Pathway%20Activated%20by%20Growth%20Factor%20and%20Cytokine.pdf</relative_filename>
<key>Sadowski93</key>
<notes></notes>
<bib_type>article</bib_type>
<bib_doi></bib_doi>
<bib_title>A common nuclear signal transduction pathway activated by growth factor and cytokine receptors.</bib_title>
<bib_authors>Sadowski, H B and Shuai, K and Darnell, J E and Gilman, M Z</bib_authors>
<bib_journal>Science</bib_journal>
<bib_volume>261</bib_volume>
<bib_number>5129</bib_number>
<bib_pages>1739-44</bib_pages>
<bib_year>1993</bib_year>
<bib_extra key="pmid">8397445</bib_extra>
  <doc>
<filename>file:///home/dwickrama/Desktop/stevenJonesLab/papers/Transcription%20Factor%20Binding/A%20Common%20Nuclear%20Signal%20Transduction%20Pathway%20Activated%20by%20Growth%20Factor%20and%20Cytokine.pdf</filename>
<bib_extra key="File">article:../Transcription\ Factor\ Binding/A\ Common\ Nuclear\ Signal\ Transduction\ Pathway\ Activated\ by\ Growth\ Factor\ and\ Cytokine.pdf:pdf</bib_extra>
<relative_filename>A%20Common%20Nuclear%20Signal%20Transduction%20Pathway%20Activated%20by%20Growth%20Factor%20and%20Cytokine.pdf</relative_filename>
<key>Sadowski93</key>
<notes></notes>
<bib_type>article</bib_type>
<bib_doi></bib_doi>
<bib_title>A common nuclear signal transduction pathway activated by growth factor and cytokine receptors.</bib_title>
<bib_authors>Sadowski, H B and Shuai, K and Darnell, J E and Gilman, M Z</bib_authors>
<bib_journal>Science</bib_journal>
<bib_volume>261</bib_volume>
<bib_number>5129</bib_number>
<bib_pages>1739-44</bib_pages>
<bib_year>1993</bib_year>
<bib_extra key="pmid">8397445</bib_extra>

file:///home/dwickrama/Desktop/stevenJonesLab/papers/Transcription%20Factor%20Binding/A%20Common%20Nuclear%20Signal%20Transduction%20Pathway%20Activated%20by%20Growth%20Factor%20and%20Cytokine.pdf
文章:../Transcription\Factor\Binding/A\Common\Nuclear\Signal\Transaction\Pathway\Activated\by\Growth\Factor\and\Cytokine.pdf:pdf
A%20普通%20核%20信号%20转导%20通路%20被%20生长%20因子%20和%20细胞因子激活%20。pdf
萨多夫斯基93
文章
一种由生长因子和细胞因子受体激活的常见核信号转导途径。
萨多夫斯基,H B和帅,K和达内尔,J E和吉尔曼,M Z
科学类
261
5129
1739-44
1993
8397445

我可以使用下面的sed命令部分执行我想要的操作,但URL编码“%20”仍然保留。我怎样才能在只有bibtex标签的情况下去掉它

sed -e 's/\(\ \ \ \ <filename>file:\/\/\/home\/dwickrama\/Desktop\/stevenJonesLab\/papers\)\([^.]*\)\(\.\?\)\(.*\)\(<\/filename>\)/\1\2\3\4\5\n\ \ \ \ <bib_extra\ key=\"File\">article:\.\.\2\3\4:\4<\/bib_extra>/g' NewPapers.reflib > NewPapers.new.reflib
sed-e的/\(\\\\\\\\\\\\\\/\/home\/dwickrama\/Desktop\/stevenjonesab\/papers\)\([^.]*\(\.\.\?\)\(\.\)/\1\2\3\4\n\\\\\\\.\2\3\4:\4/g'NewPapers.reflib>NewPapers.new.reflib

Regex和sed不是处理XML或URL解码的好工具

使用更完整的脚本语言编写的快速脚本将能够更清晰、更可靠地完成此任务。例如,在Python中:

import urllib, urlparse
from xml.dom import minidom

doc= minidom.parse('NewPapers.reflib')
el= doc.getElementsByTagName('filename')[0]
path= urlparse.urlparse(el.firstChild.data)[2]
foldername, filename= map(urllib.unquote, path.split('/')[-2:])

extra= doc.createElement('bib_extra')
extra.setAttribute('key', 'File')
extra.appendChild(document.createTextNode('article:../%s/%s:pdf' % (foldername, filename)))
el.parentNode.insertBefore(extra, el.nextSibling)
doc.writexml(open('NewPapers.new.reflib'))

(我没有在给定的示例输出中包含一个函数来重现反斜杠转义,因为它不清楚是什么格式。最简单的方法是
filename=filename.replace(“”,\\)
,但我不确定这是否正确。)

Regex和sed不是处理XML的好工具,或URL解码

使用更完整的脚本语言编写的快速脚本将能够更清晰、更可靠地完成此任务。例如,在Python中:

import urllib, urlparse
from xml.dom import minidom

doc= minidom.parse('NewPapers.reflib')
el= doc.getElementsByTagName('filename')[0]
path= urlparse.urlparse(el.firstChild.data)[2]
foldername, filename= map(urllib.unquote, path.split('/')[-2:])

extra= doc.createElement('bib_extra')
extra.setAttribute('key', 'File')
extra.appendChild(document.createTextNode('article:../%s/%s:pdf' % (foldername, filename)))
el.parentNode.insertBefore(extra, el.nextSibling)
doc.writexml(open('NewPapers.new.reflib'))

(我还没有在给定的示例输出中包含一个函数来重现反斜杠转义,因为它不清楚是什么格式。最简单的方法是
filename=filename.replace(“”,\\”)
,但我不确定这是否正确。)

您只需要在right之后添加一行即可??所以只要在搜索后打印出来就行了

#!/bin/bash

s='<bib_extra key="File">article:../Transcription\\ Factor\\ Binding/A\\ Common\\ Nuclear\\ Signal\\ Transduction\\ Pathway\\ Activated\\ by\\ Growth\\ Factor\\ and\\ Cytokine.pdf:pdf</bib_extra>'

awk -vstr="$s" '
/<filename>/{
    print
    print str;next
}
{print}' file
#/bin/bash
s='article:../Transcription\\Factor\\Binding/A\\Common\\Nuclear\\Signal\\Transaction\\Pathway\\Activated\\by\\Growth\\Factor\\and\\Cytokine.pdf:pdf'
awk-vstr=“$s””
//{
打印
打印str;下一个
}
{print}文件

您只需在右下方添加一行即可??所以只要在搜索后打印出来就行了

#!/bin/bash

s='<bib_extra key="File">article:../Transcription\\ Factor\\ Binding/A\\ Common\\ Nuclear\\ Signal\\ Transduction\\ Pathway\\ Activated\\ by\\ Growth\\ Factor\\ and\\ Cytokine.pdf:pdf</bib_extra>'

awk -vstr="$s" '
/<filename>/{
    print
    print str;next
}
{print}' file
#/bin/bash
s='article:../Transcription\\Factor\\Binding/A\\Common\\Nuclear\\Signal\\Transaction\\Pathway\\Activated\\by\\Growth\\Factor\\and\\Cytokine.pdf:pdf'
awk-vstr=“$s””
//{
打印
打印str;下一个
}
{print}文件

感谢您的回复。有一个很大的xml文件,我想复制和修改所有文件名,而不仅仅是这个特定的文件名。另外,您的示例没有适当地修改行。感谢您的回复。有一个很大的xml文件,我想复制和修改所有文件名,而不仅仅是这个特定的文件名。另外,您的示例没有适当地修改该行。顺便说一下,命令filename=filename.replace(“”,\\’)也正是我所需要的。我只是对文件名中的空格进行反斜杠转义。反斜杠本身呢<代码>.replace(“\\”、“\\\”)?您好,谢谢您的评论。您是对的,反斜杠本身必须被反斜杠转义。不过,它们不太可能出现在bibtex标题中,我的示例和整个bibtex文件都没有。一个需要反斜杠转义的字符,我意识到经常发生的是在标题中出现的冒号字符。顺便说一下,命令filename=filename.replace(“”“\\”)也正是我需要的。我只是对文件名中的空格进行反斜杠转义。反斜杠本身呢<代码>.replace(“\\”、“\\\”)?您好,谢谢您的评论。您是对的,反斜杠本身必须被反斜杠转义。不过,它们不太可能出现在bibtex标题中,我的示例和整个bibtex文件都没有。一个需要反斜杠转义的字符,我意识到经常发生的是在标题中出现冒号字符。