Python 如何读取多个缺少标题和不需要的列的.txt文件_Python_Pandas_Csv

Python 如何读取多个缺少标题和不需要的列的.txt文件

python pandas csv

Python 如何读取多个缺少标题和不需要的列的.txt文件,python,pandas,csv,Python,Pandas,Csv,我正在尝试阅读大约2000.txt文件，这些文件的列并不都相同。我只想选择所有文件中的公共标题，并将其保存到csv文件中，以便上传到MySQL数据库中。我需要帮助解析这些文件，以便只选择我需要的列。我只需要以下列：code、startDate、startTime、endDate、endTime、s、number。 startDate和endDate之后的时间列在文件中没有标题。我刚刚把它们命名为“开始时间”和“结束时间” 举例说明文件1示例： code

我正在尝试阅读大约2000.txt文件，这些文件的列并不都相同。我只想选择所有文件中的公共标题，并将其保存到csv文件中，以便上传到MySQL数据库中。我需要帮助解析这些文件，以便只选择我需要的列。我只需要以下列：code、startDate、startTime、endDate、endTime、s、number。 startDate和endDate之后的时间列在文件中没有标题。我刚刚把它们命名为“开始时间”和“结束时间”

举例说明

文件1示例：


code                         startDate        endDate          s   number
-------------------------------------- ------------------- ------------------- - ----------
4000                                   23-04-2010 00:00:00 23-04-2010 00:14:59 E          1
4001                                   23-04-2010 00:00:00 23-04-2010 00:14:59 E          0
4002                                   23-04-2010 00:00:00 23-04-2010 00:14:59 E          0
4003                                   23-04-2010 00:00:00 23-04-2010 00:14:59 E         0

文件2示例：

code                         lineNum                         startDate        endDate          s   number id description
-------------------------------------- -------------------------------------- ------------------- ------------------- - ---------- ------------------ ----------------------------------------------------------------------------------------------------
3000                                   2111201                                31-10-2010 05:45:00 31-10-2010 05:59:59 E          9                311 CAPITAL
3000                                   2111201                                31-10-2010 05:45:00 31-10-2010 05:59:59 E          4               1411 USUARIO FRECUENTE
3000                                   2111201                                31-10-2010 05:45:00 31-10-2010 05:59:59 E          1               7071 FUNCIONARIO
3000

文件列表=[file1，file2，…] 数据列表=[] 对于文件列表[]中的文件：对于openfile，“r”作为f： reader=f.readlines 对于行内读取器：使用正则表达式仅搜索包含文本和数字的行如果重新搜索“[0-9a-zA-Z]”，行： datalist.appendline.strip.split header=datalist[0] 尝试： repeatingHeaderIndx=datalist[1:].indexheader+1 使用索引从数据中删除重复标头 datalist.poprepeatingHeaderIndx 除：通过 df=pd.DataFramedatalist[1:]

当我检查我的完整数据框时，它得到的列数超过了我需要的列数，因为每个文件中的列数可能不同。

您可以修改正则表达式，使其仅与包含任一列名称的行匹配-

obj = re.compile(r'\b(code|startDate|startTime|endDate|endTime|s|number)\b')
with open('words.txt', 'r') as reader:
   for line in reader:
       match = obj.findall(line)
       datalist.append(match)

所以你的代码应该是-

file_list = [file1, file2,...]
obj = re.compile(r'\b(code|startDate|startTime|endDate|endTime|s|number)\b')

datalist = []
for file in file_list[]:
    with open(file,'r') as f:
        reader = f.readlines()
        for line in reader:
            match = obj.findall(line)
            if match:
                datalist.append(match)
header = datalist[0]
try:
    repeatingHeaderIndx = datalist[1:].index(header) + 1
    #remove repeating header from data using index  
    datalist.pop(repeatingHeaderIndx)
except:
    pass      
df = pd.DataFrame(datalist[1:])

您可以修改正则表达式，使其仅与包含任一列名称的行匹配-

obj = re.compile(r'\b(code|startDate|startTime|endDate|endTime|s|number)\b')
with open('words.txt', 'r') as reader:
   for line in reader:
       match = obj.findall(line)
       datalist.append(match)

所以你的代码应该是-

file_list = [file1, file2,...]
obj = re.compile(r'\b(code|startDate|startTime|endDate|endTime|s|number)\b')

datalist = []
for file in file_list[]:
    with open(file,'r') as f:
        reader = f.readlines()
        for line in reader:
            match = obj.findall(line)
            if match:
                datalist.append(match)
header = datalist[0]
try:
    repeatingHeaderIndx = datalist[1:].index(header) + 1
    #remove repeating header from data using index  
    datalist.pop(repeatingHeaderIndx)
except:
    pass      
df = pd.DataFrame(datalist[1:])

很难从你的文件样本中分辨出来，但这可能是固定宽度的文本文件吗？是的，先生，这是可能的。很难从你的文件样本中分辨出来，但这可能是固定宽度的文本文件吗？是的，先生，这是可能的。Yabhishek，谢谢你的解决方案。但是，“startTime”和“endTime”在文件中不作为标题存在。在每一行中，startDate和endDate后面有两个时间列，没有我需要的标题。所以我只是把它们加起来作为我需要的最终解决方案的一部分。雅比舍克，谢谢你的解决方案。但是，“startTime”和“endTime”在文件中不作为标题存在。在每一行中，startDate和endDate后面有两个时间列，没有我需要的标题。所以我只是把它们加起来作为我需要的最终解决方案的一部分。