比较两个文本文件并找出python中的相关单词

比较两个文本文件并找出python中的相关单词,python,Python,我有两个名为search.txt和log.txt的文本文件,其中包含一些如下数据 search.txt 19:00:15 , mouse , FALSE 19:00:15 , branded luggage bags and trolley , TRUE 19:00:15 , Leather shoes for men , FALSE 19:00:15 , printers , TRUE 19:00:16 , adidas watches for men , TRUE 19:00:16

我有两个名为search.txt和log.txt的文本文件,其中包含一些如下数据

search.txt

19:00:15  , mouse , FALSE
19:00:15  , branded luggage bags and trolley , TRUE
19:00:15  , Leather shoes for men , FALSE
19:00:15  , printers , TRUE
19:00:16  , adidas watches for men , TRUE
19:00:16  , Mobile Charger Stand/Holder black , FALSE
19:00:16  , watches for men , TRUE
19:00:00 ,  trakjkfsa,
19:00:00 ,  door,
19:00:00 ,  sweater,
19:00:00 ,  sweater,
19:00:00 ,  sweater,
19:00:00 ,  dis,
19:00:01 ,  not,
19:00:01 ,  nokia,
19:00:01 ,  collar,
19:00:01 ,  nokia,
19:00:01 ,  collar,
19:00:01 ,  gsm,
19:00:01 ,  sweater,
19:00:01 ,  sweater,
19:00:01 ,  gsm,
19:00:02 ,  gsm,
19:00:02 ,  show,
19:00:02 ,  wayfreyerv,
19:00:02 ,  door,
19:00:02 ,  collar,
19:00:02 ,  or,
19:00:02 ,  harman,
19:00:02 ,  women's,
19:00:02 ,  collar,
19:00:02 ,  sweater,
19:00:02 ,  head,
19:00:03 ,  womanw,
19:00:03 ,  com.shopclues.utils.k@42233ff0,
19:00:03 ,  samsu,
19:00:03 ,  adidas,
19:00:03 ,  collar,
19:00:04 ,  ambas,
19:00:04 ,  harman,
19:00:04 ,  mi,
19:00:04 ,  nor,
19:00:04 ,  airtel,
19:00:04 ,  ,
19:00:04 ,  adidas,
19:00:05 ,  harman,
19:00:05 ,  collar,
19:00:05 ,  flip,
19:00:05 ,  brass,
19:00:05 ,  laptop,
19:00:05 ,  collar,
19:00:05 ,  wayfreyer,
19:00:05 ,  head,
19:00:05 ,  adidas,
19:00:05 ,  discn,
19:00:05 ,  head,
19:00:05 ,  adidas,
19:00:05 ,  collar,
19:00:05 ,  collar,
19:00:06 ,  disco,
19:00:06 ,  head,
19:00:06 ,  harman,
19:00:06 ,  nigh,
19:00:06 ,  microsoft,
19:00:06 ,  ambassado,
19:00:07 ,  salwar,
19:00:07 ,  bb,
19:00:07 ,  harman,
19:00:07 ,  ambassador,
19:00:07 ,  ambassador,
19:00:07 ,  salwar,
19:00:08 ,  microsoft,
19:00:08 ,  ac,
19:00:08 ,  jea,
19:00:08 ,  gens, 
19:00:08 ,  ambassador,
19:00:08 ,  orpa,
19:00:09 ,  ac,
19:00:09 ,  black,
19:00:09 ,  asus,
19:00:09 ,  salwar,
19:00:09 ,  salwar,
19:00:09 ,  ac,
19:00:10 ,  whechains,
19:00:10 ,  gens,
19:00:10 ,  ambassador,
19:00:10 ,  sony,
19:00:10 ,  salwa,
19:00:10 ,  ac,
19:00:10 ,  woman,
19:00:10 ,  li,
19:00:11 ,  boxers,
19:00:11 ,  harman,
19:00:11 ,  sal,
19:00:11 ,  ambassador,
19:00:11 ,  sony, 
19:00:11 ,  ,
19:00:11 ,  boxers,
19:00:12 ,  adidas,
19:00:12 ,  samsung,
19:00:12 ,  boxer,
19:00:12 ,  boxers,
19:00:12 ,  com.shopclues.utils.k@427b9538,
19:00:12 ,  harman,
19:00:12 ,  wechains#002,
19:00:12 ,  collar,
19:00:13 ,  collar,
19:00:13 ,  collar,
19:00:13 ,  one,
19:00:13 ,  collar,
19:00:13 ,  ambassador,
19:00:13 ,  hitech,
19:00:13 ,  fanc,
19:00:13 ,  adidas,
19:00:13 ,  bp,
19:00:13 ,  asus,
19:00:13 ,  ambassador,
19:00:13 ,  harman,
19:00:14 ,  lin,
19:00:14 ,  one,
19:00:14 ,  samsung,
19:00:14 ,  cond,
19:00:14 ,  atx,
19:00:15 ,  blackles#002,
19:00:15 ,  woman,
19:00:15 ,  asus,
19:00:15 ,  airtel,
19:00:15 ,  weel,
19:00:15 ,  aenglish,
19:00:15 ,  orpat,
19:00:15 ,  one,
19:00:15 ,  condom,
19:00:15 ,  one,
19:00:15 ,  ling,
19:00:15 ,  fancy,
19:00:15 ,  orpat,
19:00:15 ,  woman,
19:00:19 , watches fo,
19:00:15  , mouse , FALSE - []
19:00:15  , branded luggage bags and trolley , TRUE - []
19:00:15  , Leather shoes for men , FALSE - []
19:00:15  , printers , TRUE - []
19:00:16  , adidas watches for men , TRUE - [adidas,adidas,adidas,adidas,adidas,adidas]
19:00:16  , Mobile Charger Stand/Holder black , FALSE - []
19:00:16  , watches for men , TRUE
log.txt

19:00:15  , mouse , FALSE
19:00:15  , branded luggage bags and trolley , TRUE
19:00:15  , Leather shoes for men , FALSE
19:00:15  , printers , TRUE
19:00:16  , adidas watches for men , TRUE
19:00:16  , Mobile Charger Stand/Holder black , FALSE
19:00:16  , watches for men , TRUE
19:00:00 ,  trakjkfsa,
19:00:00 ,  door,
19:00:00 ,  sweater,
19:00:00 ,  sweater,
19:00:00 ,  sweater,
19:00:00 ,  dis,
19:00:01 ,  not,
19:00:01 ,  nokia,
19:00:01 ,  collar,
19:00:01 ,  nokia,
19:00:01 ,  collar,
19:00:01 ,  gsm,
19:00:01 ,  sweater,
19:00:01 ,  sweater,
19:00:01 ,  gsm,
19:00:02 ,  gsm,
19:00:02 ,  show,
19:00:02 ,  wayfreyerv,
19:00:02 ,  door,
19:00:02 ,  collar,
19:00:02 ,  or,
19:00:02 ,  harman,
19:00:02 ,  women's,
19:00:02 ,  collar,
19:00:02 ,  sweater,
19:00:02 ,  head,
19:00:03 ,  womanw,
19:00:03 ,  com.shopclues.utils.k@42233ff0,
19:00:03 ,  samsu,
19:00:03 ,  adidas,
19:00:03 ,  collar,
19:00:04 ,  ambas,
19:00:04 ,  harman,
19:00:04 ,  mi,
19:00:04 ,  nor,
19:00:04 ,  airtel,
19:00:04 ,  ,
19:00:04 ,  adidas,
19:00:05 ,  harman,
19:00:05 ,  collar,
19:00:05 ,  flip,
19:00:05 ,  brass,
19:00:05 ,  laptop,
19:00:05 ,  collar,
19:00:05 ,  wayfreyer,
19:00:05 ,  head,
19:00:05 ,  adidas,
19:00:05 ,  discn,
19:00:05 ,  head,
19:00:05 ,  adidas,
19:00:05 ,  collar,
19:00:05 ,  collar,
19:00:06 ,  disco,
19:00:06 ,  head,
19:00:06 ,  harman,
19:00:06 ,  nigh,
19:00:06 ,  microsoft,
19:00:06 ,  ambassado,
19:00:07 ,  salwar,
19:00:07 ,  bb,
19:00:07 ,  harman,
19:00:07 ,  ambassador,
19:00:07 ,  ambassador,
19:00:07 ,  salwar,
19:00:08 ,  microsoft,
19:00:08 ,  ac,
19:00:08 ,  jea,
19:00:08 ,  gens, 
19:00:08 ,  ambassador,
19:00:08 ,  orpa,
19:00:09 ,  ac,
19:00:09 ,  black,
19:00:09 ,  asus,
19:00:09 ,  salwar,
19:00:09 ,  salwar,
19:00:09 ,  ac,
19:00:10 ,  whechains,
19:00:10 ,  gens,
19:00:10 ,  ambassador,
19:00:10 ,  sony,
19:00:10 ,  salwa,
19:00:10 ,  ac,
19:00:10 ,  woman,
19:00:10 ,  li,
19:00:11 ,  boxers,
19:00:11 ,  harman,
19:00:11 ,  sal,
19:00:11 ,  ambassador,
19:00:11 ,  sony, 
19:00:11 ,  ,
19:00:11 ,  boxers,
19:00:12 ,  adidas,
19:00:12 ,  samsung,
19:00:12 ,  boxer,
19:00:12 ,  boxers,
19:00:12 ,  com.shopclues.utils.k@427b9538,
19:00:12 ,  harman,
19:00:12 ,  wechains#002,
19:00:12 ,  collar,
19:00:13 ,  collar,
19:00:13 ,  collar,
19:00:13 ,  one,
19:00:13 ,  collar,
19:00:13 ,  ambassador,
19:00:13 ,  hitech,
19:00:13 ,  fanc,
19:00:13 ,  adidas,
19:00:13 ,  bp,
19:00:13 ,  asus,
19:00:13 ,  ambassador,
19:00:13 ,  harman,
19:00:14 ,  lin,
19:00:14 ,  one,
19:00:14 ,  samsung,
19:00:14 ,  cond,
19:00:14 ,  atx,
19:00:15 ,  blackles#002,
19:00:15 ,  woman,
19:00:15 ,  asus,
19:00:15 ,  airtel,
19:00:15 ,  weel,
19:00:15 ,  aenglish,
19:00:15 ,  orpat,
19:00:15 ,  one,
19:00:15 ,  condom,
19:00:15 ,  one,
19:00:15 ,  ling,
19:00:15 ,  fancy,
19:00:15 ,  orpat,
19:00:15 ,  woman,
19:00:19 , watches fo,
19:00:15  , mouse , FALSE - []
19:00:15  , branded luggage bags and trolley , TRUE - []
19:00:15  , Leather shoes for men , FALSE - []
19:00:15  , printers , TRUE - []
19:00:16  , adidas watches for men , TRUE - [adidas,adidas,adidas,adidas,adidas,adidas]
19:00:16  , Mobile Charger Stand/Holder black , FALSE - []
19:00:16  , watches for men , TRUE
从这里我需要做的是,我必须打开两个文件,如果从search.txt中选择了第一个查询,那么它将转到log.txt,并在:60秒之前和之后搜索与该查询相关的任何查询。如果它找到了与搜索查询相关的任何东西,那么它将用一个列表存储数据,并用search.txt追加

o/p应如下所示:-

search.txt

19:00:15  , mouse , FALSE
19:00:15  , branded luggage bags and trolley , TRUE
19:00:15  , Leather shoes for men , FALSE
19:00:15  , printers , TRUE
19:00:16  , adidas watches for men , TRUE
19:00:16  , Mobile Charger Stand/Holder black , FALSE
19:00:16  , watches for men , TRUE
19:00:00 ,  trakjkfsa,
19:00:00 ,  door,
19:00:00 ,  sweater,
19:00:00 ,  sweater,
19:00:00 ,  sweater,
19:00:00 ,  dis,
19:00:01 ,  not,
19:00:01 ,  nokia,
19:00:01 ,  collar,
19:00:01 ,  nokia,
19:00:01 ,  collar,
19:00:01 ,  gsm,
19:00:01 ,  sweater,
19:00:01 ,  sweater,
19:00:01 ,  gsm,
19:00:02 ,  gsm,
19:00:02 ,  show,
19:00:02 ,  wayfreyerv,
19:00:02 ,  door,
19:00:02 ,  collar,
19:00:02 ,  or,
19:00:02 ,  harman,
19:00:02 ,  women's,
19:00:02 ,  collar,
19:00:02 ,  sweater,
19:00:02 ,  head,
19:00:03 ,  womanw,
19:00:03 ,  com.shopclues.utils.k@42233ff0,
19:00:03 ,  samsu,
19:00:03 ,  adidas,
19:00:03 ,  collar,
19:00:04 ,  ambas,
19:00:04 ,  harman,
19:00:04 ,  mi,
19:00:04 ,  nor,
19:00:04 ,  airtel,
19:00:04 ,  ,
19:00:04 ,  adidas,
19:00:05 ,  harman,
19:00:05 ,  collar,
19:00:05 ,  flip,
19:00:05 ,  brass,
19:00:05 ,  laptop,
19:00:05 ,  collar,
19:00:05 ,  wayfreyer,
19:00:05 ,  head,
19:00:05 ,  adidas,
19:00:05 ,  discn,
19:00:05 ,  head,
19:00:05 ,  adidas,
19:00:05 ,  collar,
19:00:05 ,  collar,
19:00:06 ,  disco,
19:00:06 ,  head,
19:00:06 ,  harman,
19:00:06 ,  nigh,
19:00:06 ,  microsoft,
19:00:06 ,  ambassado,
19:00:07 ,  salwar,
19:00:07 ,  bb,
19:00:07 ,  harman,
19:00:07 ,  ambassador,
19:00:07 ,  ambassador,
19:00:07 ,  salwar,
19:00:08 ,  microsoft,
19:00:08 ,  ac,
19:00:08 ,  jea,
19:00:08 ,  gens, 
19:00:08 ,  ambassador,
19:00:08 ,  orpa,
19:00:09 ,  ac,
19:00:09 ,  black,
19:00:09 ,  asus,
19:00:09 ,  salwar,
19:00:09 ,  salwar,
19:00:09 ,  ac,
19:00:10 ,  whechains,
19:00:10 ,  gens,
19:00:10 ,  ambassador,
19:00:10 ,  sony,
19:00:10 ,  salwa,
19:00:10 ,  ac,
19:00:10 ,  woman,
19:00:10 ,  li,
19:00:11 ,  boxers,
19:00:11 ,  harman,
19:00:11 ,  sal,
19:00:11 ,  ambassador,
19:00:11 ,  sony, 
19:00:11 ,  ,
19:00:11 ,  boxers,
19:00:12 ,  adidas,
19:00:12 ,  samsung,
19:00:12 ,  boxer,
19:00:12 ,  boxers,
19:00:12 ,  com.shopclues.utils.k@427b9538,
19:00:12 ,  harman,
19:00:12 ,  wechains#002,
19:00:12 ,  collar,
19:00:13 ,  collar,
19:00:13 ,  collar,
19:00:13 ,  one,
19:00:13 ,  collar,
19:00:13 ,  ambassador,
19:00:13 ,  hitech,
19:00:13 ,  fanc,
19:00:13 ,  adidas,
19:00:13 ,  bp,
19:00:13 ,  asus,
19:00:13 ,  ambassador,
19:00:13 ,  harman,
19:00:14 ,  lin,
19:00:14 ,  one,
19:00:14 ,  samsung,
19:00:14 ,  cond,
19:00:14 ,  atx,
19:00:15 ,  blackles#002,
19:00:15 ,  woman,
19:00:15 ,  asus,
19:00:15 ,  airtel,
19:00:15 ,  weel,
19:00:15 ,  aenglish,
19:00:15 ,  orpat,
19:00:15 ,  one,
19:00:15 ,  condom,
19:00:15 ,  one,
19:00:15 ,  ling,
19:00:15 ,  fancy,
19:00:15 ,  orpat,
19:00:15 ,  woman,
19:00:19 , watches fo,
19:00:15  , mouse , FALSE - []
19:00:15  , branded luggage bags and trolley , TRUE - []
19:00:15  , Leather shoes for men , FALSE - []
19:00:15  , printers , TRUE - []
19:00:16  , adidas watches for men , TRUE - [adidas,adidas,adidas,adidas,adidas,adidas]
19:00:16  , Mobile Charger Stand/Holder black , FALSE - []
19:00:16  , watches for men , TRUE
让我们举一个例子: 如果“鼠标”是search.txt中放置在“19:00:15”的查询,那么它需要转到log.txt并查找与“鼠标”相关的查询。在“18:59:15-19:01:15”之间的时间意味着在search.txt之前和之后60秒。如果有任何与之相关的查询,那么它会将数据存储在search.txt中,并在该行中列出一个列表

代码如下:

import datetime
from collections import defaultdict

def getting_partial_queries(querylist):
     basequery = ' '.join(querylist)                
     querylist = []
     for n in range(2,len(basequery)+1):   
         querylist.append(basequery[:n])
     return querylist
queries_time = defaultdict(list)  
with open('logs.txt') as f:            
   for line in f:
      fields = [ x.strip() for x in line.split(',') ]  
      timestamp = datetime.datetime.strptime(fields[0], "%H:%M:%S") 
      queries_time[fields[1]].append(timestamp)  
with open('search.txt') as inputf, open('search_output.txt', 'w') as outputf:
 for line in inputf:
    fields = [ x.strip() for x in line.split(',') ]   
    timestamp = datetime.datetime.strptime(fields[0], "%H:%M:%S") 
    queries = getting_partial_queries(fields[1].split()) 
    results = []
    for q in queries:
        poss_timestamps = queries_time[q] 
        for ts in poss_timestamps:
            if timestamp - datetime.timedelta(seconds=60) <= ts <= timestamp:
                results.append(q)   
            if timestamp + datetime.timedelta(seconds=60) >= ts >= timestamp:
                results.append(q)   
    outputf.write (line.strip() + " , {}\n".format(results))
导入日期时间
从集合导入defaultdict
def获取部分查询(查询列表):
basequery=''.join(querylist)
querylist=[]
对于范围(2,len(basequery)+1)中的n:
追加(basequery[:n])
返回查询列表
查询时间=defaultdict(列表)
将open('logs.txt')作为f:
对于f中的行:
fields=[x.strip()表示行中的x.split(',')]
timestamp=datetime.datetime.strtime(字段[0],%H:%M:%S)
查询时间[字段[1]]。追加(时间戳)
将open('search.txt')作为输入,将open('search_output.txt','w')作为输出:
对于输入中的行:
fields=[x.strip()表示行中的x.split(',')]
timestamp=datetime.datetime.strtime(字段[0],%H:%M:%S)
查询=获取部分查询(字段[1].split())
结果=[]
对于查询中的q:
poss\u时间戳=查询\u时间[q]
对于poss_时间戳中的ts:
如果timestamp-datetime.timedelta(秒=60)=时间戳:
结果:追加(q)
outputf.write(line.strip()+“,{}\n.”格式(结果))
  • 读取
    log.txt
    文件,使用
    split()
    方法和
    collections
    模块从该文件中获取所有关键字计数。目标是日志文件每行的第二个字
  • 现在我们有了所有的关键字和计数器
  • 逐行读取
    search.txt
    文件
  • 从每一行中获取目标单词,即按
    拆分的第二个单词
  • 使用
    filter
    lambda
    从所选文本中搜索关键字(4)
  • 从我们的字典中获取计数值,并使用字符串格式和联接方法根据需要创建新行
  • 将创建行写入新文件
  • 代码:

    输出:

    19:00:15  , mouse , FALSE - []
    19:00:15  , branded luggage bags and trolley , TRUE - []
    19:00:15  , Leather shoes for men , FALSE - []
    19:00:15  , printers , TRUE - []
    19:00:16  , adidas watches for men , TRUE - [adidas,adidas,adidas,adidas,adidas]
    19:00:16  , Mobile Charger Stand/Holder black , FALSE - []
    19:00:16  , watches for men , TRUE - []
    
    注意:
    先试试你自己。

    虽然还不清楚你所说的“部分查询”是什么意思,但下面的代码可以做到这一点,只要你在函数
    中重新定义一个部分查询,过滤出普通查询即可。例如,如果您在
    search.txt
    中查找查询的精确匹配项,您可以通过
    return[''.join(querylist),]
    替换
    #在此处添加逻辑

    import datetime as dt
    from collections import defaultdict
    
    def filter_out_common_queries(querylist):
        # add your logic here
        return querylist
    
    queries_time = defaultdict(list)  # personally, I'd use 'set' as the default factory
    with open('log.txt') as f:
        for line in f:
            fields = [ x.strip() for x in line.split(',') ]
            timestamp = dt.datetime.strptime(fields[0], "%H:%M:%S")
            queries_time[fields[1]].append(timestamp)  
    
    with open('search.txt') as inputf, open('search_output.txt', 'w') as outputf:
        for line in inputf:
            fields = [ x.strip() for x in line.split(',') ]
            timestamp = dt.datetime.strptime(fields[0], "%H:%M:%S")
            queries = filter_out_common_queries(fields[1].split())  # "adidas watches for men" -> "adidas" "watches" "for" "men". "for" is a very generic keyword. You should do well to filter these out
            results = []
            for q in queries:
                poss_timestamps = queries_time[q]
                for ts in poss_timestamps:
                    if timestamp - dt.timedelta(seconds=15) <= ts <= timestamp:
                        results.append(q)
            outputf.write(line.strip() + " - {}\n".format(results))
    
    备注:发现“黑色移动充电器支架/支架”中的“黑色”匹配项。这是因为在上面的代码中,我查找了每个单独的单词本身

    编辑:要实现您的评论,您需要重新定义
    过滤出\u常见\u查询
    ,如下所示:

    def filter_out_common_queries(querylist):
        basequery = ' '.join(querylist)
        querylist = []
        for n in range(2,len(basequery)+1):
            querylist.append(basequery[:n])
        return querylist
    

    我想你忘了问问题了。@BurhanKhalid我已经问过了。我需要通过使用这两个输入来获得如上所述的o/p。StackOverflow是一个你发布问题的网站,而不是一个要求其他人完成你的工作的列表。那么你有没有试着自己解决这个问题,然后遇到了问题?你犯了什么错误?你能展示一些代码吗?事实上我只需要一些指南来克服它。克服什么?你需要帮助解决什么具体问题?不是这样。如果查询时间为19:00:16,则为男子值班,则为TRUE。。我需要在“log.txt”文件中找到19:00:01-19:00:16之间“男士手表”的部分查询。如果我们在这段时间内得到任何部分查询。然后我们将把它放在列表中。我需要它检查下面的部分查询“阿迪达斯男士手表”,然后它可以是“ad”“adi”“adid”。。。。在“阿迪达斯男士手表”之前,它应该总是以查询的前两个字母开头。@s_m这很明显:上面的代码是比较每个部分匹配。如果之前的序列匹配,则只需检查较大的字符串序列即可改进。示例:如果“adi”匹配,则“adid”也应检查,否则不应检查。但这不是一个代码编写服务。你应该试着自己去实现它,如果不成功就寻求帮助。因为搜索文件的每个查询的时间频率只有1秒。我的想法是,对于每个搜索查询,我们在日志文件中查找搜索查询的前15秒。但是对于下一个搜索查询,我们在日志文件中查找搜索查询的前15秒。这就是为什么它需要更多的时间。但我认为,当我们得到第二个查询时,我们不需要再找到前15秒的查询。我们只需将日志文件向下滑动1秒钟。因此,这不会花费更多的时间。你明白我想解释的吗。它会不会起作用?