比较两个文本文件并找出python中的相关单词
我有两个名为search.txt和log.txt的文本文件,其中包含一些如下数据 search.txt比较两个文本文件并找出python中的相关单词,python,Python,我有两个名为search.txt和log.txt的文本文件,其中包含一些如下数据 search.txt 19:00:15 , mouse , FALSE 19:00:15 , branded luggage bags and trolley , TRUE 19:00:15 , Leather shoes for men , FALSE 19:00:15 , printers , TRUE 19:00:16 , adidas watches for men , TRUE 19:00:16
19:00:15 , mouse , FALSE
19:00:15 , branded luggage bags and trolley , TRUE
19:00:15 , Leather shoes for men , FALSE
19:00:15 , printers , TRUE
19:00:16 , adidas watches for men , TRUE
19:00:16 , Mobile Charger Stand/Holder black , FALSE
19:00:16 , watches for men , TRUE
19:00:00 , trakjkfsa,
19:00:00 , door,
19:00:00 , sweater,
19:00:00 , sweater,
19:00:00 , sweater,
19:00:00 , dis,
19:00:01 , not,
19:00:01 , nokia,
19:00:01 , collar,
19:00:01 , nokia,
19:00:01 , collar,
19:00:01 , gsm,
19:00:01 , sweater,
19:00:01 , sweater,
19:00:01 , gsm,
19:00:02 , gsm,
19:00:02 , show,
19:00:02 , wayfreyerv,
19:00:02 , door,
19:00:02 , collar,
19:00:02 , or,
19:00:02 , harman,
19:00:02 , women's,
19:00:02 , collar,
19:00:02 , sweater,
19:00:02 , head,
19:00:03 , womanw,
19:00:03 , com.shopclues.utils.k@42233ff0,
19:00:03 , samsu,
19:00:03 , adidas,
19:00:03 , collar,
19:00:04 , ambas,
19:00:04 , harman,
19:00:04 , mi,
19:00:04 , nor,
19:00:04 , airtel,
19:00:04 , ,
19:00:04 , adidas,
19:00:05 , harman,
19:00:05 , collar,
19:00:05 , flip,
19:00:05 , brass,
19:00:05 , laptop,
19:00:05 , collar,
19:00:05 , wayfreyer,
19:00:05 , head,
19:00:05 , adidas,
19:00:05 , discn,
19:00:05 , head,
19:00:05 , adidas,
19:00:05 , collar,
19:00:05 , collar,
19:00:06 , disco,
19:00:06 , head,
19:00:06 , harman,
19:00:06 , nigh,
19:00:06 , microsoft,
19:00:06 , ambassado,
19:00:07 , salwar,
19:00:07 , bb,
19:00:07 , harman,
19:00:07 , ambassador,
19:00:07 , ambassador,
19:00:07 , salwar,
19:00:08 , microsoft,
19:00:08 , ac,
19:00:08 , jea,
19:00:08 , gens,
19:00:08 , ambassador,
19:00:08 , orpa,
19:00:09 , ac,
19:00:09 , black,
19:00:09 , asus,
19:00:09 , salwar,
19:00:09 , salwar,
19:00:09 , ac,
19:00:10 , whechains,
19:00:10 , gens,
19:00:10 , ambassador,
19:00:10 , sony,
19:00:10 , salwa,
19:00:10 , ac,
19:00:10 , woman,
19:00:10 , li,
19:00:11 , boxers,
19:00:11 , harman,
19:00:11 , sal,
19:00:11 , ambassador,
19:00:11 , sony,
19:00:11 , ,
19:00:11 , boxers,
19:00:12 , adidas,
19:00:12 , samsung,
19:00:12 , boxer,
19:00:12 , boxers,
19:00:12 , com.shopclues.utils.k@427b9538,
19:00:12 , harman,
19:00:12 , wechains#002,
19:00:12 , collar,
19:00:13 , collar,
19:00:13 , collar,
19:00:13 , one,
19:00:13 , collar,
19:00:13 , ambassador,
19:00:13 , hitech,
19:00:13 , fanc,
19:00:13 , adidas,
19:00:13 , bp,
19:00:13 , asus,
19:00:13 , ambassador,
19:00:13 , harman,
19:00:14 , lin,
19:00:14 , one,
19:00:14 , samsung,
19:00:14 , cond,
19:00:14 , atx,
19:00:15 , blackles#002,
19:00:15 , woman,
19:00:15 , asus,
19:00:15 , airtel,
19:00:15 , weel,
19:00:15 , aenglish,
19:00:15 , orpat,
19:00:15 , one,
19:00:15 , condom,
19:00:15 , one,
19:00:15 , ling,
19:00:15 , fancy,
19:00:15 , orpat,
19:00:15 , woman,
19:00:19 , watches fo,
19:00:15 , mouse , FALSE - []
19:00:15 , branded luggage bags and trolley , TRUE - []
19:00:15 , Leather shoes for men , FALSE - []
19:00:15 , printers , TRUE - []
19:00:16 , adidas watches for men , TRUE - [adidas,adidas,adidas,adidas,adidas,adidas]
19:00:16 , Mobile Charger Stand/Holder black , FALSE - []
19:00:16 , watches for men , TRUE
log.txt
19:00:15 , mouse , FALSE
19:00:15 , branded luggage bags and trolley , TRUE
19:00:15 , Leather shoes for men , FALSE
19:00:15 , printers , TRUE
19:00:16 , adidas watches for men , TRUE
19:00:16 , Mobile Charger Stand/Holder black , FALSE
19:00:16 , watches for men , TRUE
19:00:00 , trakjkfsa,
19:00:00 , door,
19:00:00 , sweater,
19:00:00 , sweater,
19:00:00 , sweater,
19:00:00 , dis,
19:00:01 , not,
19:00:01 , nokia,
19:00:01 , collar,
19:00:01 , nokia,
19:00:01 , collar,
19:00:01 , gsm,
19:00:01 , sweater,
19:00:01 , sweater,
19:00:01 , gsm,
19:00:02 , gsm,
19:00:02 , show,
19:00:02 , wayfreyerv,
19:00:02 , door,
19:00:02 , collar,
19:00:02 , or,
19:00:02 , harman,
19:00:02 , women's,
19:00:02 , collar,
19:00:02 , sweater,
19:00:02 , head,
19:00:03 , womanw,
19:00:03 , com.shopclues.utils.k@42233ff0,
19:00:03 , samsu,
19:00:03 , adidas,
19:00:03 , collar,
19:00:04 , ambas,
19:00:04 , harman,
19:00:04 , mi,
19:00:04 , nor,
19:00:04 , airtel,
19:00:04 , ,
19:00:04 , adidas,
19:00:05 , harman,
19:00:05 , collar,
19:00:05 , flip,
19:00:05 , brass,
19:00:05 , laptop,
19:00:05 , collar,
19:00:05 , wayfreyer,
19:00:05 , head,
19:00:05 , adidas,
19:00:05 , discn,
19:00:05 , head,
19:00:05 , adidas,
19:00:05 , collar,
19:00:05 , collar,
19:00:06 , disco,
19:00:06 , head,
19:00:06 , harman,
19:00:06 , nigh,
19:00:06 , microsoft,
19:00:06 , ambassado,
19:00:07 , salwar,
19:00:07 , bb,
19:00:07 , harman,
19:00:07 , ambassador,
19:00:07 , ambassador,
19:00:07 , salwar,
19:00:08 , microsoft,
19:00:08 , ac,
19:00:08 , jea,
19:00:08 , gens,
19:00:08 , ambassador,
19:00:08 , orpa,
19:00:09 , ac,
19:00:09 , black,
19:00:09 , asus,
19:00:09 , salwar,
19:00:09 , salwar,
19:00:09 , ac,
19:00:10 , whechains,
19:00:10 , gens,
19:00:10 , ambassador,
19:00:10 , sony,
19:00:10 , salwa,
19:00:10 , ac,
19:00:10 , woman,
19:00:10 , li,
19:00:11 , boxers,
19:00:11 , harman,
19:00:11 , sal,
19:00:11 , ambassador,
19:00:11 , sony,
19:00:11 , ,
19:00:11 , boxers,
19:00:12 , adidas,
19:00:12 , samsung,
19:00:12 , boxer,
19:00:12 , boxers,
19:00:12 , com.shopclues.utils.k@427b9538,
19:00:12 , harman,
19:00:12 , wechains#002,
19:00:12 , collar,
19:00:13 , collar,
19:00:13 , collar,
19:00:13 , one,
19:00:13 , collar,
19:00:13 , ambassador,
19:00:13 , hitech,
19:00:13 , fanc,
19:00:13 , adidas,
19:00:13 , bp,
19:00:13 , asus,
19:00:13 , ambassador,
19:00:13 , harman,
19:00:14 , lin,
19:00:14 , one,
19:00:14 , samsung,
19:00:14 , cond,
19:00:14 , atx,
19:00:15 , blackles#002,
19:00:15 , woman,
19:00:15 , asus,
19:00:15 , airtel,
19:00:15 , weel,
19:00:15 , aenglish,
19:00:15 , orpat,
19:00:15 , one,
19:00:15 , condom,
19:00:15 , one,
19:00:15 , ling,
19:00:15 , fancy,
19:00:15 , orpat,
19:00:15 , woman,
19:00:19 , watches fo,
19:00:15 , mouse , FALSE - []
19:00:15 , branded luggage bags and trolley , TRUE - []
19:00:15 , Leather shoes for men , FALSE - []
19:00:15 , printers , TRUE - []
19:00:16 , adidas watches for men , TRUE - [adidas,adidas,adidas,adidas,adidas,adidas]
19:00:16 , Mobile Charger Stand/Holder black , FALSE - []
19:00:16 , watches for men , TRUE
从这里我需要做的是,我必须打开两个文件,如果从search.txt中选择了第一个查询,那么它将转到log.txt,并在:60秒之前和之后搜索与该查询相关的任何查询。如果它找到了与搜索查询相关的任何东西,那么它将用一个列表存储数据,并用search.txt追加
o/p应如下所示:-
search.txt
19:00:15 , mouse , FALSE
19:00:15 , branded luggage bags and trolley , TRUE
19:00:15 , Leather shoes for men , FALSE
19:00:15 , printers , TRUE
19:00:16 , adidas watches for men , TRUE
19:00:16 , Mobile Charger Stand/Holder black , FALSE
19:00:16 , watches for men , TRUE
19:00:00 , trakjkfsa,
19:00:00 , door,
19:00:00 , sweater,
19:00:00 , sweater,
19:00:00 , sweater,
19:00:00 , dis,
19:00:01 , not,
19:00:01 , nokia,
19:00:01 , collar,
19:00:01 , nokia,
19:00:01 , collar,
19:00:01 , gsm,
19:00:01 , sweater,
19:00:01 , sweater,
19:00:01 , gsm,
19:00:02 , gsm,
19:00:02 , show,
19:00:02 , wayfreyerv,
19:00:02 , door,
19:00:02 , collar,
19:00:02 , or,
19:00:02 , harman,
19:00:02 , women's,
19:00:02 , collar,
19:00:02 , sweater,
19:00:02 , head,
19:00:03 , womanw,
19:00:03 , com.shopclues.utils.k@42233ff0,
19:00:03 , samsu,
19:00:03 , adidas,
19:00:03 , collar,
19:00:04 , ambas,
19:00:04 , harman,
19:00:04 , mi,
19:00:04 , nor,
19:00:04 , airtel,
19:00:04 , ,
19:00:04 , adidas,
19:00:05 , harman,
19:00:05 , collar,
19:00:05 , flip,
19:00:05 , brass,
19:00:05 , laptop,
19:00:05 , collar,
19:00:05 , wayfreyer,
19:00:05 , head,
19:00:05 , adidas,
19:00:05 , discn,
19:00:05 , head,
19:00:05 , adidas,
19:00:05 , collar,
19:00:05 , collar,
19:00:06 , disco,
19:00:06 , head,
19:00:06 , harman,
19:00:06 , nigh,
19:00:06 , microsoft,
19:00:06 , ambassado,
19:00:07 , salwar,
19:00:07 , bb,
19:00:07 , harman,
19:00:07 , ambassador,
19:00:07 , ambassador,
19:00:07 , salwar,
19:00:08 , microsoft,
19:00:08 , ac,
19:00:08 , jea,
19:00:08 , gens,
19:00:08 , ambassador,
19:00:08 , orpa,
19:00:09 , ac,
19:00:09 , black,
19:00:09 , asus,
19:00:09 , salwar,
19:00:09 , salwar,
19:00:09 , ac,
19:00:10 , whechains,
19:00:10 , gens,
19:00:10 , ambassador,
19:00:10 , sony,
19:00:10 , salwa,
19:00:10 , ac,
19:00:10 , woman,
19:00:10 , li,
19:00:11 , boxers,
19:00:11 , harman,
19:00:11 , sal,
19:00:11 , ambassador,
19:00:11 , sony,
19:00:11 , ,
19:00:11 , boxers,
19:00:12 , adidas,
19:00:12 , samsung,
19:00:12 , boxer,
19:00:12 , boxers,
19:00:12 , com.shopclues.utils.k@427b9538,
19:00:12 , harman,
19:00:12 , wechains#002,
19:00:12 , collar,
19:00:13 , collar,
19:00:13 , collar,
19:00:13 , one,
19:00:13 , collar,
19:00:13 , ambassador,
19:00:13 , hitech,
19:00:13 , fanc,
19:00:13 , adidas,
19:00:13 , bp,
19:00:13 , asus,
19:00:13 , ambassador,
19:00:13 , harman,
19:00:14 , lin,
19:00:14 , one,
19:00:14 , samsung,
19:00:14 , cond,
19:00:14 , atx,
19:00:15 , blackles#002,
19:00:15 , woman,
19:00:15 , asus,
19:00:15 , airtel,
19:00:15 , weel,
19:00:15 , aenglish,
19:00:15 , orpat,
19:00:15 , one,
19:00:15 , condom,
19:00:15 , one,
19:00:15 , ling,
19:00:15 , fancy,
19:00:15 , orpat,
19:00:15 , woman,
19:00:19 , watches fo,
19:00:15 , mouse , FALSE - []
19:00:15 , branded luggage bags and trolley , TRUE - []
19:00:15 , Leather shoes for men , FALSE - []
19:00:15 , printers , TRUE - []
19:00:16 , adidas watches for men , TRUE - [adidas,adidas,adidas,adidas,adidas,adidas]
19:00:16 , Mobile Charger Stand/Holder black , FALSE - []
19:00:16 , watches for men , TRUE
让我们举一个例子:
如果“鼠标”是search.txt中放置在“19:00:15”的查询,那么它需要转到log.txt并查找与“鼠标”相关的查询。在“18:59:15-19:01:15”之间的时间意味着在search.txt之前和之后60秒。如果有任何与之相关的查询,那么它会将数据存储在search.txt中,并在该行中列出一个列表
代码如下:
import datetime
from collections import defaultdict
def getting_partial_queries(querylist):
basequery = ' '.join(querylist)
querylist = []
for n in range(2,len(basequery)+1):
querylist.append(basequery[:n])
return querylist
queries_time = defaultdict(list)
with open('logs.txt') as f:
for line in f:
fields = [ x.strip() for x in line.split(',') ]
timestamp = datetime.datetime.strptime(fields[0], "%H:%M:%S")
queries_time[fields[1]].append(timestamp)
with open('search.txt') as inputf, open('search_output.txt', 'w') as outputf:
for line in inputf:
fields = [ x.strip() for x in line.split(',') ]
timestamp = datetime.datetime.strptime(fields[0], "%H:%M:%S")
queries = getting_partial_queries(fields[1].split())
results = []
for q in queries:
poss_timestamps = queries_time[q]
for ts in poss_timestamps:
if timestamp - datetime.timedelta(seconds=60) <= ts <= timestamp:
results.append(q)
if timestamp + datetime.timedelta(seconds=60) >= ts >= timestamp:
results.append(q)
outputf.write (line.strip() + " , {}\n".format(results))
导入日期时间
从集合导入defaultdict
def获取部分查询(查询列表):
basequery=''.join(querylist)
querylist=[]
对于范围(2,len(basequery)+1)中的n:
追加(basequery[:n])
返回查询列表
查询时间=defaultdict(列表)
将open('logs.txt')作为f:
对于f中的行:
fields=[x.strip()表示行中的x.split(',')]
timestamp=datetime.datetime.strtime(字段[0],%H:%M:%S)
查询时间[字段[1]]。追加(时间戳)
将open('search.txt')作为输入,将open('search_output.txt','w')作为输出:
对于输入中的行:
fields=[x.strip()表示行中的x.split(',')]
timestamp=datetime.datetime.strtime(字段[0],%H:%M:%S)
查询=获取部分查询(字段[1].split())
结果=[]
对于查询中的q:
poss\u时间戳=查询\u时间[q]
对于poss_时间戳中的ts:
如果timestamp-datetime.timedelta(秒=60)=时间戳:
结果:追加(q)
outputf.write(line.strip()+“,{}\n.”格式(结果))
log.txt
文件,使用split()
方法和collections
模块从该文件中获取所有关键字计数。目标是日志文件每行的第二个字search.txt
文件,
拆分的第二个单词filter
和lambda
从所选文本中搜索关键字(4)19:00:15 , mouse , FALSE - []
19:00:15 , branded luggage bags and trolley , TRUE - []
19:00:15 , Leather shoes for men , FALSE - []
19:00:15 , printers , TRUE - []
19:00:16 , adidas watches for men , TRUE - [adidas,adidas,adidas,adidas,adidas]
19:00:16 , Mobile Charger Stand/Holder black , FALSE - []
19:00:16 , watches for men , TRUE - []
注意:
先试试你自己。虽然还不清楚你所说的“部分查询”是什么意思,但下面的代码可以做到这一点,只要你在函数
中重新定义一个部分查询,过滤出普通查询即可。例如,如果您在search.txt
中查找查询的精确匹配项,您可以通过return[''.join(querylist),]
替换#在此处添加逻辑
import datetime as dt
from collections import defaultdict
def filter_out_common_queries(querylist):
# add your logic here
return querylist
queries_time = defaultdict(list) # personally, I'd use 'set' as the default factory
with open('log.txt') as f:
for line in f:
fields = [ x.strip() for x in line.split(',') ]
timestamp = dt.datetime.strptime(fields[0], "%H:%M:%S")
queries_time[fields[1]].append(timestamp)
with open('search.txt') as inputf, open('search_output.txt', 'w') as outputf:
for line in inputf:
fields = [ x.strip() for x in line.split(',') ]
timestamp = dt.datetime.strptime(fields[0], "%H:%M:%S")
queries = filter_out_common_queries(fields[1].split()) # "adidas watches for men" -> "adidas" "watches" "for" "men". "for" is a very generic keyword. You should do well to filter these out
results = []
for q in queries:
poss_timestamps = queries_time[q]
for ts in poss_timestamps:
if timestamp - dt.timedelta(seconds=15) <= ts <= timestamp:
results.append(q)
outputf.write(line.strip() + " - {}\n".format(results))
备注:发现“黑色移动充电器支架/支架”中的“黑色”匹配项。这是因为在上面的代码中,我查找了每个单独的单词本身
编辑:要实现您的评论,您需要重新定义过滤出\u常见\u查询
,如下所示:
def filter_out_common_queries(querylist):
basequery = ' '.join(querylist)
querylist = []
for n in range(2,len(basequery)+1):
querylist.append(basequery[:n])
return querylist
我想你忘了问问题了。@BurhanKhalid我已经问过了。我需要通过使用这两个输入来获得如上所述的o/p。StackOverflow是一个你发布问题的网站,而不是一个要求其他人完成你的工作的列表。那么你有没有试着自己解决这个问题,然后遇到了问题?你犯了什么错误?你能展示一些代码吗?事实上我只需要一些指南来克服它。克服什么?你需要帮助解决什么具体问题?不是这样。如果查询时间为19:00:16,则为男子值班,则为TRUE。。我需要在“log.txt”文件中找到19:00:01-19:00:16之间“男士手表”的部分查询。如果我们在这段时间内得到任何部分查询。然后我们将把它放在列表中。我需要它检查下面的部分查询“阿迪达斯男士手表”,然后它可以是“ad”“adi”“adid”。。。。在“阿迪达斯男士手表”之前,它应该总是以查询的前两个字母开头。@s_m这很明显:上面的代码是比较每个部分匹配。如果之前的序列匹配,则只需检查较大的字符串序列即可改进。示例:如果“adi”匹配,则“adid”也应检查,否则不应检查。但这不是一个代码编写服务。你应该试着自己去实现它,如果不成功就寻求帮助。因为搜索文件的每个查询的时间频率只有1秒。我的想法是,对于每个搜索查询,我们在日志文件中查找搜索查询的前15秒。但是对于下一个搜索查询,我们在日志文件中查找搜索查询的前15秒。这就是为什么它需要更多的时间。但我认为,当我们得到第二个查询时,我们不需要再找到前15秒的查询。我们只需将日志文件向下滑动1秒钟。因此,这不会花费更多的时间。你明白我想解释的吗。它会不会起作用?