Python 从internet读取文件并拆分为2个_Python_File_Split_Httplib2

Python 从internet读取文件并拆分为2个

python file

Python 从internet读取文件并拆分为2个,python,file,split,httplib2,Python,File,Split,Httplib2,我是Python的新手，尝试以下方法：我正在从internet读取一个文件，并希望将其拆分为一定数量的行。 1.文件=第1到x行 2.文件=第x+1行到eof 我使用httplib2从internet读取该文件，然后将该文件拆分为2。尝试了它与，但似乎我不能使用f.readline等当我从互联网上读取文件，并使用它与。如果我打开一个本地文件，它可以正常工作我错过什么了吗非常感谢您事先的帮助将data_file作为f:data_file是从internet读取的文件以下是我的功能： de

我是Python的新手，尝试以下方法：我正在从internet读取一个文件，并希望将其拆分为一定数量的行。 1.文件=第1到x行 2.文件=第x+1行到eof

我使用httplib2从internet读取该文件，然后将该文件拆分为2。尝试了它与，但似乎我不能使用f.readline等当我从互联网上读取文件，并使用它与。如果我打开一个本地文件，它可以正常工作

我错过什么了吗

非常感谢您事先的帮助

将data_file作为f:data_file是从internet读取的文件

以下是我的功能：

 def create_data_files(data_file):

    # read the file from the internet and split it into two files

    # Loading file give info if the file was loaded from cache or internet
    try:
        print("Reading file from the Internet or Cache")
        h = httplib2.Http(".cache")
        data_header, data_file = h.request(DATA_URL) # , headers={'cache-control':'no-cache'}) # to force download form internet
        data_file = data_file.decode()


    except httplib2.HttpLib2Error as e:
        print(e)

    # Give the info if the file was read from the internet or from the cache

    print("DataHeader", data_header.fromcache)

    if data_header.fromcache == True:
        print("File was read from cache")
    else:
        print("File was read from the internet")

    # Counting the amount of total characters in the file - only for testing
    # print("Total amount of characters in the original file", len(data_file)) # just for testing

    # Counting the lines in the file
    print("Counting lines in the file")
    single_line = data_file.split("\n")
    for value in single_line:
        value =value.strip()
        #print(value)   # juist for testing - prints all the lines separeted
    print("Total amount of lines in the original file", len(single_line))

    # Asking the user how many lines in percentage of the total amount should be training data
    while True:
        #split_factor = int(input("What percentage should be use as training data? Enter a number between 0 and 100: "))
        split_factor = 70
        print("Split Factor set to 70% for test purposes")
        if 0 <= split_factor <= 100:
            break
        print('try again')

    split_number = int(len(single_line)*split_factor/100)
    print("Number of Training set data", split_number) # just for testing

    # Splitting the file into 2

    training_data_file = 0
    test_data_file = 0




    return training_data_file, test_data_file

这应该可以实现未经测试和简化的技巧

数据头，数据文件=h.requestDATA\u URL

数据文件不是一个类似文件的对象，而是一个字符串

谢谢-但是得到一个类型错误：Type str不支持line=dequecontent中的缓冲区API。split'\n'你能粘贴你得到的异常消息吗？当然，这里是：line=dequecontent.split'\n'类型错误：Type str不支持缓冲区API我使用python2.7。是否使用python 3？哦，对不起。是的-我使用Python3-今天通过艰苦的学习了解到存在差异…可能重复

from collections import deque
import httplib2


def create_data_files(data_url, split_factor=0.7):

    h = httplib2.Http()
    resp_headers, content = h.request(data_url, "GET")
    # for python3
    content = content.decode()

    lines = deque(content.split('\n'))

    stop = len(lines) * split_factor
    training, test = [], []
    i = 0
    while lines:
        l = lines.popleft()
        if i <= stop:
            training.append(l)
        else:
            test.append(l)
        i +=1

    training_str, test_str = '\n'.join(training), '\n'.join(test)
    return training_str, test_str