如何使用python计算文本文件中的总行数_Python_File_File Io_Sum

如何使用python计算文本文件中的总行数

python file file-io

如何使用python计算文本文件中的总行数,python,file,file-io,sum,Python,File,File Io,Sum,例如，如果我的文本文件是： blue green yellow black 这里有四行，现在我想得到四行的结果。如何执行此操作？可以与生成器表达式一起使用： count=0 with open ('filename.txt','rb') as f: for line in f: count+=1 print count with open('data.txt') as f: print sum(1 for _ in f) 请注意，您不能使用len（f），因

例如，如果我的文本文件是：

blue
green
yellow
black

这里有四行，现在我想得到四行的结果。如何执行此操作？

可以与生成器表达式一起使用：

count=0
with open ('filename.txt','rb') as f:
    for line in f:
        count+=1

print count

with open('data.txt') as f:
    print sum(1 for _ in f)

请注意，您不能使用

len（f）

，因为

是一个

\u

是一次性变量的特殊变量名，请参阅

您可以使用

len（f.readlines（））

，但这将在内存中创建一个额外的列表，它甚至无法处理不适合内存的大型文件。

您可以将

sum（）

与生成器表达式一起使用。生成器表达式将是

[1，1，…]

到文件长度。然后我们调用

sum（）

将它们相加，得到总数

with open('text.txt') as myfile:
    count = sum(1 for line in myfile)

根据您的尝试，您似乎不希望包含空行。然后，您可以执行以下操作：

with open('text.txt') as myfile:
    count = sum(1 for line in myfile if line.rstrip('\n'))

这个link（）有很多潜在的解决方案，但它们都忽略了一种使其运行得更快的方法，即使用无缓冲（raw）接口、使用bytearray和自己进行缓冲

使用改进版的计时工具，我相信下面的代码比提供的任何解决方案都要快（并且稍微更像python）：

def _make_gen(reader):
    b = reader(1024 * 1024)
    while b:
        yield b
        b = reader(1024*1024)

def rawpycount(filename):
    f = open(filename, 'rb')
    f_gen = _make_gen(f.raw.read)
    return sum( buf.count(b'\n') for buf in f_gen )

以下是我的时间安排：

rawpycount        0.0048  0.0046   1.00
bufcount          0.0074  0.0066   1.43
wccount             0.01    0.01   2.17
itercount          0.014   0.014   3.04
opcount            0.021    0.02   4.43
kylecount          0.023   0.021   4.58
simplecount        0.022   0.022   4.81
mapcount           0.038   0.032   6.82

我会把它贴在那里，但我是堆叠exchange的一个相对新的用户，没有必要的甘露

编辑：

这完全可以通过使用itertools在线生成表达式来完成，但它看起来非常奇怪：

from itertools import (takewhile,repeat)

def rawbigcount(filename):
    f = open(filename, 'rb')
    bufgen = takewhile(lambda x: x, (f.raw.read(1024*1024) for _ in repeat(None)))
    return sum( buf.count(b'\n') for buf in bufgen if buf )

这一条也给出了文件中的行数

a=open('filename.txt','r')
l=a.read()
count=l.splitlines()
print(len(count))

一艘班轮：

total_line_count = sum(1 for line in open("filename.txt"))

print(total_line_count)

使用：

那就行了。

对于那些说用open（“filename.txt”，“r”）作为f的

用户，你可以做anyname=open（“filename.txt”，“r”）

下面是通过列表理解来实现的方法，但这会浪费一点计算机内存，因为line.strip（）已被调用两次
     with open('textfile.txt') as file:
lines =[
            line.strip()
            for line in file
             if line.strip() != '']
print("number of lines =  {}".format(len(lines)))

我对stackoverflow并不陌生，只是从来没有账户，通常来这里寻求答案。我还不能评论或投票赞成一个答案。但我想说的是，上面迈克尔·培根的代码非常有效。我不熟悉Python，但不熟悉编程。我一直在读Python速成班，我想做一些事情来打破这种逐页阅读的方法。从ETL甚至数据质量的角度来看，一个实用程序可以独立于任何ETL捕获文件的行数。该文件的行数为X，导入到SQL或Hadoop中，最终的行数为X。您可以在最低级别验证原始数据文件的行数
我一直在使用他的代码并做一些测试，到目前为止，这段代码非常有效。我已经创建了几个不同的CSV文件，不同的大小和行数。你可以在下面看到我的代码，我的评论提供了时间和细节。上述Michael Bacon提供的代码的运行速度大约是普通Python方法（仅循环行）的6倍
希望这对别人有帮助

如果导入pandas
，则可以使用该函数来确定这一点。不知道它的性能如何。代码如下：
import pandas as pd
data=pd.read_csv("yourfile") #reads in your file
num_records=[]               #creates an array 
num_records=data.shape       #assigns the 2 item result from shape to the array
n_records=num_records[0]     #assigns number of lines to n_records

以fp:if line.strip（）中的行的fp:for形式打开（'data.txt'）：count+=1@alecxe它会工作吗？是的，它会工作，但解决方案不是pythonic，最好使用sum（）
。已经足够解释了；-）可能重复的谢谢！这个itertool实现非常快速，我可以在读取非常大的文件时给出完成百分比。我得到一个错误：AttributeError:“file”对象没有属性“raw”。知道为什么吗？这里的代码是特定于python 3的，原始/unicode拆分就发生在那里。在这一点上，我的Python2内存不太好，但是如果您使用Python2，我认为如果您将open（）调用的模式更改为'r'，并将“f.raw.read（）”更改为“f.read（）”，那么在Python2中，您将有效地得到相同的结果。将第一个示例中的return语句更改为return sum（map（methodcaller（“count”，b'\n'），f_gen））
，从操作员导入methodcaller
，有助于加快任何（'imap
从itertools
导入，如果是python2）？我还将构造1024*1024
数学以节省一些额外的周期。也希望看到与第二个示例的比较。那么pythonic，那么非常pythonic:o如果您将open（'data.txt'）作为f:print sum（[1代表f中的u]）编写，是否会更快？@jimh-最好只使用sum（1代表f中的u））
，因为它在括号内隐式使用生成器表达式，并且不创建1的列表。但是，您的版本sum（[1代表f中的uu]）会在求和之前创建一个1的列表，这会不必要地分配内存。@blokeley是不是以牺牲内存为代价更快是我的问题question@jimh这里没有这样的权衡。生成器表达式将做得更少，因为它不必花费时间分配内存。如果您可以重用分配的列表或dict，那么理解可以是一种优化。
     with open('textfile.txt') as file:
lines =[
            line.strip()
            for line in file
             if line.strip() != '']
print("number of lines =  {}".format(len(lines)))

 import time
from itertools import (takewhile,repeat)

def readfilesimple(myfile):

    # watch me whip
    linecounter = 0
    with open(myfile,'r') as file_object:
        # watch me nae nae
         for lines in file_object:
            linecounter += 1

    return linecounter

def readfileadvanced(myfile):

    # watch me whip
    f = open(myfile, 'rb')
    # watch me nae nae
    bufgen = takewhile(lambda x: x, (f.raw.read(1024 * 1024) for _ in repeat(None)))
    return sum(buf.count(b'\n') for buf in bufgen if buf)
    #return linecounter


# ************************************
# Main
# ************************************

#start the clock

start_time = time.time()

# 6.7 seconds to read a 475MB file that has 24 million rows and 3 columns
#mycount = readfilesimple("c:/junk/book1.csv")

# 0.67 seconds to read a 475MB file that has 24 million rows and 3 columns
#mycount = readfileadvanced("c:/junk/book1.csv")

# 25.9 seconds to read a 3.9Gb file that has 3.25 million rows and 104 columns
#mycount = readfilesimple("c:/junk/WideCsvExample/ReallyWideReallyBig1.csv")

# 5.7 seconds to read a 3.9Gb file that has 3.25 million rows and 104 columns
#mycount = readfileadvanced("c:/junk/WideCsvExample/ReallyWideReallyBig1.csv")


# 292.92 seconds to read a 43Gb file that has 35.7 million rows and 104 columns
mycount = readfilesimple("c:/junk/WideCsvExample/ReallyWideReallyBig.csv")

# 57 seconds to read a 43Gb file that has 35.7 million rows and 104 columns
#mycount = readfileadvanced("c:/junk/WideCsvExample/ReallyWideReallyBig.csv")


#stop the clock
elapsed_time = time.time() - start_time


print("\nCode Execution: " + str(elapsed_time) + " seconds\n")
print("File contains: " + str(mycount) + " lines of text.")

import pandas as pd
data=pd.read_csv("yourfile") #reads in your file
num_records=[]               #creates an array 
num_records=data.shape       #assigns the 2 item result from shape to the array
n_records=num_records[0]     #assigns number of lines to n_records