Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/362.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 包含请求的URL的爬网列表。get_Python_Csv - Fatal编程技术网

Python 包含请求的URL的爬网列表。get

Python 包含请求的URL的爬网列表。get,python,csv,Python,Csv,我正在尝试对CSV文件中包含的URL列表进行爬网。URL列在CSV的第6列中。URL的格式为: 我没有用下面的代码正确读取CSV中的数据。我在哪里犯了编码错误 list_of_urls = open(filename).read() for i in range(6,len(list_of_urls)): try: url=str(list_of_urls[i][0]) #crawl urls secondCrawlRequest =

我正在尝试对CSV文件中包含的URL列表进行爬网。URL列在CSV的第6列中。URL的格式为:

我没有用下面的代码正确读取CSV中的数据。我在哪里犯了编码错误

list_of_urls = open(filename).read()

for i in range(6,len(list_of_urls)):

    try:
        url=str(list_of_urls[i][0])
        #crawl urls
        secondCrawlRequest = requests.get(url, headers=http_headers, timeout=5)

        raw_html = secondCrawlRequest.text
    except requests.ConnectionError as e:
        logging.exception(e)
    except requests.HTTPError as e:
        logging.exception(e)
    except requests.Timeout as e:
        logging.exception(e)
    except requests.RequestException as e:
        logging.exception(e)
        sys.exit(1)
你应使用:

如果需要跳过标题行,可以通过调用
next(reader)

你应使用:

如果需要跳过标题行,可以通过调用
next(reader)


如果url对于csv中的列或行没有固定的引用,您可以简单地使用正则表达式并按如下方式逐行读取文件:

import re
import requests

filename = 'shitty_url.csv'
with open(filename, 'r') as csvfile:
    for line in csvfile:
        url_pattern = re.search('https:\/\/(.+?) ', line)
        if url_pattern:
            found_url = url_pattern.group(1)
            url = 'https://%s' % found_url
            crawler = requests.get(url, timeout=5)

希望这有帮助:)

如果url在csv中没有固定的列或行出现,您可以简单地使用regex并按如下方式逐行读取文件:

import re
import requests

filename = 'shitty_url.csv'
with open(filename, 'r') as csvfile:
    for line in csvfile:
        url_pattern = re.search('https:\/\/(.+?) ', line)
        if url_pattern:
            found_url = url_pattern.group(1)
            url = 'https://%s' % found_url
            crawler = requests.get(url, timeout=5)

希望这有帮助:)

URL从CSV中的第二行开始。如何绕过标题行?@lifeicomplex添加URL从CSV中的第二行开始。如何绕过标题行?@lifeicomplex添加
import re
import requests

filename = 'shitty_url.csv'
with open(filename, 'r') as csvfile:
    for line in csvfile:
        url_pattern = re.search('https:\/\/(.+?) ', line)
        if url_pattern:
            found_url = url_pattern.group(1)
            url = 'https://%s' % found_url
            crawler = requests.get(url, timeout=5)