[技巧]基于邻近性搜索多个出现的词对。python
我有一个正文和两个关键词,比如k1,k2。我想找到k1和k2出现在say5个单词附近的所有实例。现在,我希望存储此搜索中的两条信息-[技巧]基于邻近性搜索多个出现的词对。python,python,regex,search,Python,Regex,Search,我有一个正文和两个关键词,比如k1,k2。我想找到k1和k2出现在say5个单词附近的所有实例。现在,我希望存储此搜索中的两条信息- 此类匹配的数量 最佳匹配的单词位置这里的“最佳”是指k1和k2之间最接近的匹配。这是为了让我以后可以在这场比赛中做更多的工作 我已经编写了一个代码,但它无法找到匹配项,如下所示。而且,它没有给我匹配的数量或单词的位置 import re text = 'the flory of gthys inhibition in this proffession by in
import re
text = 'the flory of gthys inhibition in this proffession by in aquaporin protein-1 its inhibition by the state of the art in aquaporin 2'
a = 'aquaporin protein-1'
b = 'inhibition'
diff=500
l = re.split(';|,|-| ', text)
l1 = re.split(';|,|-| ', a)
l2 = re.split(';|,|-| ', b)
counts=[m.start() for m in re.finditer(a, text)]
counts1=[m.start() for m in re.finditer(b, text)]
for cc in counts:
for c1 in counts1:
if abs(cc-c1) < diff:
diff = abs(cc-c1)
values = (cc, c1)
if text.find(a) < text.find(b):
r= (l.index(l2[0]) - l.index(l1[-1]))
if text.find(a) > text.find(b):
r= (l.index(l1[0]) - l.index(l2[-1]))
if r<5:
print 'matched'
print r
重新导入
text='本研究中,水通道蛋白-1对gthys的抑制作用以及水通道蛋白2对gthys的最新抑制作用'
a=‘水通道蛋白-1’
b=‘抑制’
差值=500
l=重新拆分(“;|,|-|”,文本)
l1=重新拆分(“;|,|-|”,a)
l2=重新拆分(“;|,|-|”,b)
counts=[m.start()表示re.finditer(a,text)中的m]
counts1=[m.start()表示re.finditer(b,text)中的m]
对于cc in计数:
对于计数1中的c1:
如果abs(cc-c1)text.find(b):
r=(l.index(l1[0])-l.index(l2[-1]))
如果r我决定在原始文本中替换您的多词关键字,因为这样可以检测短语,因为它们在将字符串拆分为空白后不会拆分
然后是一个带有索引和值的简单循环,它使计数和跟踪/存储在元组中,关键字的位置与最小接近度匹配
text = 'the flory of gthys inhibition in this proffession by in aquaporin protein-1 its inhibition b'
a = 'aquaporin protein-1'
b = 'inhibition'
text = text.replace(a, 'k1')
text = text.replace(b, 'k2')
l = text.split()
#print l
#print 'k1 -> %s' % a
#print 'k2 -> %s' % b
last_a = -1
last_b = -1
counts = 0
max_match_tuple = (6,0) # Initialize it like this since you want to track proximity less than 5
for k,v in enumerate(l):
#print str(k) + '--->' + str(v)
if v == 'k1':
last_a = k
if k - last_b < 6 and last_b != -1:
counts = counts + 1
if k - last_b < max_match_tuple[0] - max_match_tuple[1]:
max_match_tuple = (k, last_b)
if v == 'k2':
last_b = k
if k - last_a < 6 and last_a != -1:
counts = counts + 1
if k - last_a < max_match_tuple[0] - max_match_tuple[1]:
max_match_tuple = (k, last_a) # Careful with the order here since it matters for above substruction
print counts
print max_match_tuple
所以我有我自己的代码
试试看。
好处是它提供了一个元组列表(单词之间的距离、关键字1的索引、关键字2的索引):
因此,最近的距离将是数据的第一个元素(但可能有多个元素具有相同的距离):
--------------编辑------------
所以,如果你想把多个单词看成一个单词(考虑距离),你必须首先替换它们,这个(未测试的)代码可能会起作用。
input = 'the flory of gthys inhibition in this proffession by in aquaporin protein-1 its inhibition b , aquaporin protein-1'
a = 'aquaporin protein-1'
b = 'inhibition'
multiwords = ['aquaporin protein-1']
for mw in multiwords:
mw_no_space = mw.replace(' ', '__')
text = input.replace(mw, mw_no_space)
k1 = a.replace(' ', '__')
k2 = b.replace(' ', '__')
l = text.split()
d_idx = {k1:[], k2:[]}
for k,v in enumerate(l):
if v == k1:
d_idx[k1].append(k)
elif v == k2:
d_idx[k2].append(k)
distance = 10
data = []
for idx1 in d_idx[k1]:
for idx2 in d_idx[k2]:
d = abs(idx1 - idx2)
if d<=distance:
data.append((d,idx1,idx2))
data.sort(key=lambda x: x[0])
print data
print "Least distance: ", data[0][0]
print "Index of kw1 and kw2: ", data[0][1:]
print "Number of occurences: ", len(data)
input='水通道蛋白-1及其抑制b、水通道蛋白-1在本研究中对gthys的抑制范围'
a=‘水通道蛋白-1’
b=‘抑制’
multiwords=['水通道蛋白-1']
对于多字的mw:
mw_无_空间=mw。替换(“”,_uuu’)
text=输入。替换(mw,mw\u无空间)
k1=a.替换(“”,“”)
k2=b.替换(“”,“”)
l=text.split()
d_idx={k1:[],k2:[]}
对于枚举(l)中的k,v:
如果v==k1:
d_idx[k1]。追加(k)
elif v==k2:
d_idx[k2]。追加(k)
距离=10
数据=[]
对于d_idx[k1]中的idx1:
对于d_idx[k2]中的idx2:
d=abs(idx1-idx2)
如果d理论上你可以用正则表达式来做,但是支持所有的边缘情况会非常混乱
简单表格为:
(?Pkey1)\s+(?p(\w+\s+(?!key2)){0,4}\w+\s+(?Pkey2)
样本数据:
word0关键字1关键字2关键字1
word0关键字1 word1 word2关键字2 word3
关键词0关键字1关键字1关键字2关键字3关键字2关键字4
关键词0关键字1关键字1关键字2关键字3关键字4关键字2关键字5
单词0关键字1关键字1关键字2关键字3关键字4关键字5关键字2关键字6
word0关键字1 word1 word2 word3 word4 word5 word6关键字2 word7
为什么要删除发布的代码?我建议把它推回原处。@Stribizov,我觉得还不够好。不过我已经把它加回去了。谢谢它可能不是,但它提供了重要的信息,例如kw可能是短语,并且为潜在的提问者提供了一个良好的开端。我不知道接受谁的答案。两者几乎同样有用:(我对两者都投了赞成票though@Ciitk34如果你真的不能在它们之间做出选择,那就掷硬币吧。Akis,我的大多数关键词实际上是多词的。因此,这个代码可能不起作用。你对这样的“短语”关键词有什么想法吗?在你的例子中,你想把水通道蛋白-1
作为两个关键词水通道蛋白和protein-1
?不水通道蛋白-1
是一个关键词,抑制
是另一个关键词。我在想类似的事情。喜欢你如何管理“短语”而不是单词,以及考虑两个接近方向(正在寻找这些)@Ciitk34我将进行编辑,以便更好地向您解释它的工作原理。嘿,您能解释这一行吗?d_idx={k1:[],k2:[]}
我正在构建一个包含键k1
和k2
(这将是您定义的kewords)的字典,并为每个键分配一个空列表(以便我以后可以附加数据)由于我是新来的,我没有想到使用字典。你能告诉我为什么在这里调用它而不是数组是必要的吗?我没有必要使用字典,如果v==k1:list\u k1.append(k)elif v==k2:list\u k2.append(k)
text = 'the flory of gthys inhibition in this proffession by in aquaporin protein-1 its inhibition b , aquaporin protein-1'
a = 'aquaporin protein-1'
b = 'inhibition'
k1 = "_KEYWORD_1_"
k2 = "_KEYWORD_2_"
text = text.replace(a, k1)
text = text.replace(b, k2)
l = text.split()
d_idx = {k1:[], k2:[]}
for k,v in enumerate(l):
if v == k1:
d_idx[k1].append(k)
elif v == k2:
d_idx[k2].append(k)
distance = 5
data = []
for idx1 in d_idx[k1]:
for idx2 in d_idx[k2]:
d = abs(idx1 - idx2)
if d<=distance:
data.append((d,idx1,idx2))
data.sort(key=lambda x: x[0])
print "Least distance: ", data[0][0]
print "Index of kw1 and kw2: ", data[0][1:]
print "Number of occurences: ", len(data)
input = 'the flory of gthys inhibition in this proffession by in aquaporin protein-1 its inhibition b , aquaporin protein-1'
a = 'aquaporin protein-1'
b = 'inhibition'
multiwords = ['aquaporin protein-1']
for mw in multiwords:
mw_no_space = mw.replace(' ', '__')
text = input.replace(mw, mw_no_space)
k1 = a.replace(' ', '__')
k2 = b.replace(' ', '__')
l = text.split()
d_idx = {k1:[], k2:[]}
for k,v in enumerate(l):
if v == k1:
d_idx[k1].append(k)
elif v == k2:
d_idx[k2].append(k)
distance = 10
data = []
for idx1 in d_idx[k1]:
for idx2 in d_idx[k2]:
d = abs(idx1 - idx2)
if d<=distance:
data.append((d,idx1,idx2))
data.sort(key=lambda x: x[0])
print data
print "Least distance: ", data[0][0]
print "Index of kw1 and kw2: ", data[0][1:]
print "Number of occurences: ", len(data)