Python 为什么我的MySQL数据库不包含从Scrapy发送到它的所有数据?
我是SQL新手,正在尝试使用python将大量数据存储到MySQL数据库中。出于某种原因,在将大约24000行数据发送到我的数据库后,我发现它只包含1300行数据 我的硬盘没有满 我在推送数据时没有收到错误(为此我使用python) 这可能与存储引擎InnoDB有关,但考虑到这1300行占用176 KB,我对此表示怀疑。这可能是最后几点,因为我无法理解文档,因为它更多地讨论了以字节和页面为单位的数据大小限制,而不是行数限制,我无法关联 我在从python处理数据库时使用的语句如下所示Python 为什么我的MySQL数据库不包含从Scrapy发送到它的所有数据?,python,mysql,scrapy,transactions,bulk-load,Python,Mysql,Scrapy,Transactions,Bulk Load,我是SQL新手,正在尝试使用python将大量数据存储到MySQL数据库中。出于某种原因,在将大约24000行数据发送到我的数据库后,我发现它只包含1300行数据 我的硬盘没有满 我在推送数据时没有收到错误(为此我使用python) 这可能与存储引擎InnoDB有关,但考虑到这1300行占用176 KB,我对此表示怀疑。这可能是最后几点,因为我无法理解文档,因为它更多地讨论了以字节和页面为单位的数据大小限制,而不是行数限制,我无法关联 我在从python处理数据库时使用的语句如下所示 数据库创建
CREATE DATABASE database_name
"CREATE TABLE table_name ( \
id INT PRIMARY KEY, \
price INT, \
model VARCHAR(40), \
year INT, \
body VARCHAR(30), \
milage INT, \
engine_size FLOAT, \
engine_power INT, \
transmission VARCHAR(10), \
fuel_type VARCHAR(30), \
owners INT, \
ultra_low_emission_zone INT, \
service_history VARCHAR(30), \
first_year_road_tax INT, \
full_manufacturer_warranty INT \
);
CREATE DATABASE database_name
"CREATE TABLE table_name ( \
id INT PRIMARY KEY, \
price INT, \
model VARCHAR(40), \
year INT, \
body VARCHAR(30), \
milage INT, \
engine_size FLOAT, \
engine_power INT, \
transmission VARCHAR(10), \
fuel_type VARCHAR(30), \
owners INT, \
ultra_low_emission_zone INT, \
service_history VARCHAR(30), \
first_year_road_tax INT, \
full_manufacturer_warranty INT \
);
"INSERT INTO table_name VALUES\
("+str(id_carrier.carried_id + counter)+",\
"+price+", \
'"+model+"', \
"+year+", \
'"+body+"', \
"+milage+", \
"+engine_size+", \
"+engine_power+", \
'"+transmission+"', \
'"+fuel_type+"', \
"+owners+", \
"+ultra_low_emission_zone+", \
'"+service_history+"', \
"+first_year_road_tax_included+", \
"+manufacturer_warranty+" \
);"
counter = 0
# retrieving offers
offers = response.xpath('//li[@class = "search-page__result"]')[1:-2]
for offer in offers:
# reinitializing the data which can be missing
year = '0'
body = 'unlisted'
milage = '1000000'
engine_size = '50'
engine_power = '1000000'
transmission = 'unlisted'
fuel_type = 'unlisted'
owners = '100'
ultra_low_emission_zone = '0'
service_history = 'unlisted'
first_year_road_tax_included = '0'
manufacturer_warranty = '0'
# getting price of offer
price = Selector(text=offer.extract()).xpath('//div[@class = "product-card-pricing__price"]//span/text()').get()
# formatting price of offer
price = price.replace(',','').replace('£','')
# getting offer model
model = Selector(text=offer.extract()).xpath('//h3[@class = "product-card-details__title"]/text()').get()
# formatting model
model = model.replace('\n','').replace('BMW ','').strip().lower()
# going through some clustered data and applying formatting
clustered_details = Selector(text=offer.extract()).xpath('//li[@class = "atc-type-picanto--medium"]/text()').getall()
for detail in clustered_details:
if 'reg' in detail.lower():
year = detail.split(' ')[0]
continue
elif detail.lower() == 'convertible' or \
detail.lower() == 'coupe' or \
detail.lower() == 'estate' or \
detail.lower() == 'hatchback' or \
detail.lower() == 'mpv' or \
detail.lower() == 'suv' or \
detail.lower() == 'saloon':
body = detail.lower()
continue
elif 'miles' in detail:
milage = detail.lower().replace(',','').replace(' miles','')
continue
elif detail[0] in '0123456' and detail[1] =='.' and detail[2] in '0123456':
engine_size = detail.lower().replace('l','')
continue
elif detail[0].isnumeric() == True and detail[1].isnumeric() == True and 'p' in detail.lower():
engine_power = first_number(detail)
continue
elif detail.lower() == 'manual' or detail.lower() == 'automatic':
transmission = detail.lower()
continue
elif detail.lower() == 'diesel' or \
detail.lower() == 'diesel hybrid' or \
detail.lower() == 'diesel plug-in hybrid' or \
detail.lower() == 'electric' or \
detail.lower() == 'petrol' or \
detail.lower() == 'petrol hybrid' or \
detail.lower() == 'petrol plug-in hybrid':
fuel_type = detail.lower()
continue
elif detail.lower() == 'full service history':
service_history = 'full service history'
continue
elif detail.lower() == 'part non dealer' or detail.lower() == 'part service history':
service_history = 'part service history'
continue
elif detail.lower() == 'full dealership history' or detail.lower() == 'full dealer':
service_history = 'full dealership history'
continue
elif detail.lower() == 'ulez':
ultra_low_emission_zone = '1'
continue
elif 'owner' in detail.lower():
owners = detail.lower().split(' ')[0]
continue
elif detail.lower() == 'first year road tax included':
first_year_road_tax_included = '1'
continue
elif detail.lower() == 'full manufacturer warranty':
manufacturer_warranty = '1'
continue
else:
print('Unexpected value ',detail)
exit()
counter += 1
insert_query = "INSERT INTO " +make+ " VALUES\
("+str(id_carrier.carried_id + counter)+", \
"+price+", \
'"+model+"', \
"+year+", \
'"+body+"', \
"+milage+", \
"+engine_size+", \
"+engine_power+", \
'"+transmission+"', \
'"+fuel_type+"', \
"+owners+", \
"+ultra_low_emission_zone+", \
'"+service_history+"', \
"+first_year_road_tax_included+", \
"+manufacturer_warranty+" \
);"
db.execute_query(insert_query, connection)
# print(price,model,year,body,milage,engine_size,engine_power,transmission,fuel_type,owners,ultra_low_emission_zone,service_history,\
# first_year_road_tax_included,manufacturer_warranty)
id_carrier.carried_id = id_carrier.carried_id + counter
try:
next_page = response.xpath('//a[@class = "pagination--right__active"]/@data-paginate')[0].root
except IndexError:
print("All the pages have been scraped")
exit()
url = ".."+next_page
time.sleep(3 + random.uniform(0,4))
yield scrapy.Request(url=url, callback=self.parse, headers = header)
编辑2:为编辑1中代码中不可见的功能添加数据库管理代码
import mysql.connector
from mysql.connector import Error
import pandas as pd
def create_server_connection(host_name, user_name, user_password, db_name = None):
connection = None
if db_name != None:
try:
connection = mysql.connector.connect(
host=host_name,
user=user_name,
passwd=user_password,
database=db_name
)
print("Connection to database " + db_name + " established.")
except Error as err:
if err.errno != 1049:
print(f"Error: '{err}'")
exit()
print("Requested database does not exist. Creating it.")
connection = mysql.connector.connect(
host=host_name,
user=user_name,
passwd=user_password
)
create_database_query = "CREATE DATABASE " + db_name
create_database(create_database_query, connection)
connection = mysql.connector.connect(
host=host_name,
user=user_name,
passwd=user_password,
database=db_name
)
print("Connection to database " + db_name + " established.")
return connection
def create_database(query, connection):
cursor = connection.cursor()
try:
cursor.execute(query)
print("Database created successfully.")
except Error as err:
print(f"Error: '{err}'")
def execute_query(query, connection):
cursor = connection.cursor()
try:
cursor.execute(query)
connection.commit()
except Error as err:
if err.errno != 1050:
print(f"Error: '{err}'")
exit()
print("Table " + query.split('TABLE')[1][1:].split(' ')[0] + " exists. Continuing script.")
我猜(但我不确定)您正在尝试对其值列表中的所有24k行执行一个gigundo插入操作
MySql对语句长度有(长)限制。通常这没有问题,但它可能已经截断了大量的插入
试着把它分成100行
编辑:感谢您对数据流的澄清。MySQL的python连接器不进行插入和更新
这意味着MySQL服务器会将您的更改累积到一个数据库中。24K行是一个事务的许多行,它可能超出了事务缓冲空间
因此,在每插入大约100行之后,您应该
您也可以在设置连接时执行cnx.autocommit=True
。但是,通过逐个提交行进行批量加载非常缓慢
python连接器不同于大多数其他语言连接器,因为默认情况下它不自动提交。这很令人困惑。我猜(但我不确定)您正在尝试对其值列表中的所有24k行执行一个gigundo INSERT操作
MySql对语句长度有(长)限制。通常这没有问题,但它可能已经截断了大量的插入
试着把它分成100行
编辑:感谢您对数据流的澄清。MySQL的python连接器不进行插入和更新
这意味着MySQL服务器会将您的更改累积到一个数据库中。24K行是一个事务的许多行,它可能超出了事务缓冲空间
因此,在每插入大约100行之后,您应该
您也可以在设置连接时执行cnx.autocommit=True
。但是,通过逐个提交行进行批量加载非常缓慢
python连接器不同于大多数其他语言连接器,因为默认情况下它不自动提交。这太令人困惑了。我要感谢克里斯·沙勒和O.琼斯帮了我的忙。在他们的指导下,我成功地将问题调试到了ScrapyWebCrawler上。文档和StackOverflow post中详细说明了每个响应最多有100个操作。我要感谢Chris Schaller和O.Jones对我的帮助。在他们的指导下,我成功地将问题调试到了ScrapyWebCrawler上。如文档和StackOverflow post中所述,每个响应最多有100个操作。您如何知道发送了24000行?如果insert没有执行任何存在性检查,则可能是循环逻辑没有发送所有行,或者是使用乐观批量insert且失败的行未通过唯一约束或其他约束。请向我们展示处理该文件的代码。我已经发布了处理该文件的python代码。你是对的,我不知道数据是否像我相信的那样发送和存储。我将修改我的代码以构建一个.CSV文件来验证这一点。数据是逐行输入的。检查数据库的响应。执行查询(插入查询,连接)我不熟悉有这种方法的库,但通常SQL执行方法返回受影响的行数。查找行数为零或小于1(某些实现返回-1以指示错误),然后对输入SQL和参数进行进一步调查。否则,看看插入的数据,它总是第一个吗?当然,成功的行或被排除的行还有其他模式。可能是您的分页方法,1300是一个可疑的整数,一页中有多少记录,多少页是1300条记录。。。这可能与SQL根本没有任何关系……如果其他一切都符合要求,@o-jones的答案听起来似乎是最有可能的罪魁祸首,您如何知道您发送了24000行?如果insert没有执行任何存在性检查,则可能是循环逻辑没有发送所有行,或者是使用乐观批量insert且失败的行未通过唯一约束或其他约束。请向我们展示处理该文件的代码。我已经发布了处理该文件的python代码。你是对的,我不知道数据是否像我相信的那样发送和存储。我将修改我的代码以构建一个.CSV文件来验证这一点。数据是逐行输入的。检查数据库的响应。执行查询(插入查询,连接)我不熟悉有这种方法的库,但通常SQL执行方法返回受影响的行数。查找行数为零或小于1(某些实现返回-1以指示错误),然后对输入SQL和参数进行进一步调查。否则,请查看插入的数据w