Python 从txt文件中提取文本并将其转换为df_Python_Python 3.x_Pandas_Dataframe

Python 从txt文件中提取文本并将其转换为df

python python-3.x pandas dataframe

Python 从txt文件中提取文本并将其转换为df,python,python-3.x,pandas,dataframe,Python,Python 3.x,Pandas,Dataframe,将此txt文件与值关联 google.com('172.217.163.46', 443) commonName: *.google.com issuer: GTS CA 1O1 notBefore: 2020-02-12 11:47:11 notAfter: 2020-05-06 11:47:11 facebook.com('31.13.79.35', 443) commonName: *.facebook.c

将此txt文件与值关联

google.com('172.217.163.46', 443)
        commonName: *.google.com
        issuer: GTS CA 1O1
        notBefore: 2020-02-12 11:47:11
        notAfter:  2020-05-06 11:47:11

facebook.com('31.13.79.35', 443)
        commonName: *.facebook.com
        issuer: DigiCert SHA2 High Assurance Server CA
        notBefore: 2020-01-16 00:00:00
        notAfter:  2020-04-15 12:00:00

如何将其转换为df

尝试此操作并部分成功：

f = open("out.txt", "r")
a=(f.read())


a=(pd.read_csv(StringIO(data),
              header=None,
     #use a delimiter not present in the text file
     #forces pandas to read data into one column
              sep="/",
              names=['string'])
     #limit number of splits to 1
  .string.str.split(':',n=1,expand=True)
  .rename({0:'Name',1:'temp'},axis=1)
  .assign(temp = lambda x: np.where(x.Name.str.strip()
                             #look for string that ends 
                             #with a bracket
                              .str.match(r'(.*[)]$)'),
                              x.Name,
                              x.temp),
          Name = lambda x: x.Name.str.replace(r'(.*[)]$)','Name')
          )
   #remove whitespace
 .assign(Name = lambda x: x.Name.str.strip())
 .pivot(columns='Name',values='temp')
 .ffill()
 .dropna(how='any')
 .reset_index(drop=True)
 .rename_axis(None,axis=1)
 .filter(['Name','commonName','issuer','notBefore','notAfter'])      
  )

但这是循环，给了我多个数据，就像单行有多个重复的数据一样

试试这个：

#==============
#读取文本文件
# ==============
文件=打开（'in.txt'）
lines=file.readlines（）
# ==============
#写一篇口述
# ==============
mydict={}
对于范围内的i（0，len（行），6）：
# ==============
#在dict中加上“Name”
# ==============
如果mydict中没有“名称”：
mydict['Name']=[]
mydict['Name'].append（行[i].strip（'\n'））
# ==============
#将其他COL添加到dict
# ==============
对于行中的行[i+1:i+5]：
键，*value=line.strip（）.strip（'\n'）.split（'：'，maxsplit=1）
如果密钥不在mydict中：
mydict[键]=[]
mydict[key].append（“”.join（value）.strip（））
pd.数据帧（mydict）

输出：

+----+-----------------------------------+----------------+----------------------------------------+---------------------+---------------------+
||名称|通用名称|发行人|不在前|不在后|
|----+-----------------------------------+----------------+----------------------------------------+---------------------+---------------------|
|0 | google.com（'172.217.163.46'，443）|*。google.com | GTS CA 1O1 | 2020-02-12 11:47:11 | 2020-05-06 11:47:11|
|1 | facebook.com（'31.13.79.35'，443）|*.facebook.com | DigiCert SHA2高保证服务器CA | 2020-01-16 00:00:00 | 2020-04-15 12:00:00|
+----+-----------------------------------+----------------+----------------------------------------+---------------------+---------------------+

该文件不是csv格式，因此您不应使用

read\u csv

阅读它，而应手动解析它。在这里，您可以执行以下操作：

with open("out.txt") as fd:
    cols = {'commonName','issuer','notBefore','notAfter'}  # columns to keep
    rows = []                                              # list of records
    for line in fd:
        line = line.strip()
        if ':' in line:
            elt = line.split(':', 1)                       # data line: parse it
            if elt[0] in cols:
                rec[elt[0]] = elt[1]
        elif len(line) > 0:
            rec = {'Name': line}                           # initial line of a block
            rows.append(rec)

a = pd.DataFrame(rows)         # and build the dataframe from the list of records

它给出：

                                Name       commonName                                   issuer               notAfter             notBefore
0  google.com('172.217.163.46', 443)     *.google.com                               GTS CA 1O1    2020-05-06 11:47:11   2020-02-12 11:47:11
1   facebook.com('31.13.79.35', 443)   *.facebook.com   DigiCert SHA2 High Assurance Server CA    2020-04-15 12:00:00   2020-01-16 00:00:00