从特定值python开始读取3000个文件并将其连接到一个数据帧中_Python_Parsing_Pandas_Processing_Ipython

从特定值python开始读取3000个文件并将其连接到一个数据帧中

python parsing pandas processing ipython

从特定值python开始读取3000个文件并将其连接到一个数据帧中,python,parsing,pandas,processing,ipython,Python,Parsing,Pandas,Processing,Ipython,我有3000个.dat文件，正在读取并连接到一个数据帧中。它们具有相同的格式（4列，没有标题），只是其中一些在文件的开头有描述，而另一些没有。为了连接这些文件，我需要在连接它们之前去掉第一行。pandas.read_csv（）的skiprows选项在这里不适用，因为要跳过的行数在文件之间非常不一致（顺便说一句，我使用pandas.read_csv（）而不是pandas.read_table（），因为文件之间用逗号分隔）但是，我试图省略的行后面的第一个值对于所有3000个文件都是相同的。这个值是

我有3000个.dat文件，正在读取并连接到一个数据帧中。它们具有相同的格式（4列，没有标题），只是其中一些在文件的开头有描述，而另一些没有。为了连接这些文件，我需要在连接它们之前去掉第一行。

pandas.read_csv（）

的

skiprows

选项在这里不适用，因为要跳过的行数在文件之间非常不一致（顺便说一句，我使用

pandas.read_csv（）

而不是

pandas.read_table（）

，因为文件之间用逗号分隔）

但是，我试图省略的行后面的第一个值对于所有3000个文件都是相同的。这个值是“2004”，它是我的数据集的第一个数据点

是否有一个类似于

skiprows

的方法，我可以提到“从“2004”开始读取文件，然后跳过之前的所有内容（对于3,00个文件中的每一个）

我现在真的很不走运，希望能得到一些帮助

谢谢！

你可以循环浏览它们，跳过每一行不以2004开头的内容

类似于

while True:
    line = pandas.read_csv()
    if line[0] != '2004': continue
    # whatever else you need here

可能不值得在这里尝试聪明；如果你有一个方便的标准，你不妨用它来找出什么是

skiprows

，例如

import pandas as pd
import csv

def find_skip(filename):
    with open(filename, newline="") as fp:
        # (use open(filename, "rb") in Python 2)
        reader = csv.reader(fp)
        for i, row in enumerate(reader):
            if row[0] == "2004":
                return i

for filename in filenames:
    skiprows = find_skip(filename)
    if skiprows is None:
        raise ValueError("something went wrong in determining skiprows!")
    this_df = pd.read_csv(filename, skiprows=skiprows, header=None)
    # do something here, e.g. append this_df to a list and concatenate it after the loop

使用

skip_to（）

函数：

def skip_to(f, text):
    while True:
        last_pos = f.tell()
        line = f.readline()
        if not line:
            return False
        if line.startswith(text):
            f.seek(last_pos)
            return True


with open("tmp.txt") as f:
    if skip_to(f, "2004"):
        df = pd.read_csv(f, header=None)
        print df

如果您确定跳过的内容不包含字符串2004，您可以将文件作为字符串读取，在那里拆分，然后用换行符拆分其余内容。或者您可以循环浏览这些内容，而不使用不以2004开头的行。