Ruby 红宝石线程&;互斥:为什么我的代码无法按顺序获取JSON?
我编写了一个爬虫程序,它使用8个线程从Internet下载JSON:Ruby 红宝石线程&;互斥:为什么我的代码无法按顺序获取JSON?,ruby,multithreading,mutex,Ruby,Multithreading,Mutex,我编写了一个爬虫程序,它使用8个线程从Internet下载JSON: #encoding: utf-8 require 'net/http' require 'sqlite3' require 'zlib' require 'json' require 'thread' $mutex = Mutex.new # Lock of database and $cnt $cntMutex = Mutex.new # Lock of $threadCnt $threadCnt = 0 # number
#encoding: utf-8
require 'net/http'
require 'sqlite3'
require 'zlib'
require 'json'
require 'thread'
$mutex = Mutex.new # Lock of database and $cnt
$cntMutex = Mutex.new # Lock of $threadCnt
$threadCnt = 0 # number of running threads
$cnt = 0 # number of lines in this COMMIT to database
db = SQLite3::Database.new "price.db"
db.results_as_hash = true
STDOUT.sync = true
start = 10000000
def fetch(http, url, timeout = 10)
# ...
end
def parsePrice( i, db)
ss = fetch(Net::HTTP.start('p.3.cn',80), 'http://p.3.cn/prices/get?skuid=J_'+i.to_s)
doc = JSON.parse(ss)[0]
puts "processing "+i.to_s
STDOUT.flush
begin
$mutex.synchronize {
$cnt = $cnt+1
db.execute("insert into prices (id, price) VALUES (?,?)", [i,doc["p"].to_f])
if $cnt > 20
db.execute('COMMIT')
db.execute('BEGIN')
$cnt = 0
end
}
rescue SQLite3::ConstraintException
warn("duplicate id: "+i.to_s)
$cntMutex.synchronize {
$threadCnt -= 1;
}
Thread.terminate
rescue NoMethodError
warn("Matching failed")
rescue
raise
ensure
end
$cntMutex.synchronize {
$threadCnt -= 1;
}
end
puts "will now start from " + start.to_s()
db.execute("BEGIN")
Thread.new {
for ii in start..12000000 do
sleep 0.1 while $threadCnt > 7
$cntMutex.synchronize {
$threadCnt += 1;
}
Thread.new {
parsePrice( ii, db)
}
end
db.execute('COMMIT')
} . join
然后我创建了一个名为price.db
的数据库:
sqlite3 > create table prices (id INT PRIMATY KEY, price REAL);
为了使我的代码线程安全,db
、$cnt
、$threadCnt
都受到$mutex
或$cntMutex
的保护
但是,当我尝试运行此脚本时,打印了以下消息:
[lz@lz crawl]$ ruby priceCrawler.rb
will now start from 10000000
http://p.3.cn/prices/get?skuid=J_10000008http://p.3.cn/prices/get?skuid=J_10000008
http://p.3.cn/prices/get?skuid=J_10000008http://p.3.cn/prices/get?skuid=J_10000002http://p.3.cn/prices/get?skuid=J_10000008
http://p.3.cn/prices/get?skuid=J_10000008
http://p.3.cn/prices/get?skuid=J_10000002http://p.3.cn/prices/get?skuid=J_10000002
processing 10000002
processing 10000002processing 10000008processing 10000008processing 10000002
duplicate id: 10000002
duplicate id: 10000002processing 10000008
processing 10000008duplicate id: 10000008
duplicate id: 10000008processing 10000008
duplicate id: 10000008
这个脚本似乎跳过了某个id,并多次使用相同的id调用了parsePrice
那么为什么会出现这种错误呢?任何帮助都将不胜感激。在我看来,您的线程调度是错误的。我已经修改了您的代码,以说明您触发的一些可能的比赛条件
re 'net/http'
require 'sqlite3'
require 'zlib'
require 'json'
require 'thread'
$mutex = Mutex.new # Lock of database and $cnt
$cntMutex = Mutex.new # Lock of $threadCnt
$threadCnt = 0 # number of running threads
$cnt = 0 # number of lines in this COMMIT to database
db = SQLite3::Database.new "price.db"
db.results_as_hash = true
STDOUT.sync = true
start = 10000000
def fetch(http, url, timeout = 10)
# ...
end
def parsePrice(i, db)
must_terminate = false
ss = fetch(Net::HTTP.start('p.3.cn',80), "http://p.3.cn/prices/get?skuid=J_#{i}")
doc = JSON.parse(ss)[0]
puts "processing #{i}"
STDOUT.flush
begin
$mutex.synchronize {
$cnt = $cnt+1
db.execute("insert into prices (id, price) VALUES (?,?)", [i,doc["p"].to_f])
if $cnt > 20
db.execute('COMMIT')
db.execute('BEGIN')
$cnt = 0
end
}
rescue SQLite3::ConstraintException
warn("duplicate id: #{i}")
must_terminate = true
rescue NoMethodError
warn("Matching failed")
rescue
# Raising here does not prevent ensure from running.
# It will raise after we decrement $threadCnt on
# ensure clause.
raise
ensure
$cntMutex.synchronize {
$threadCnt -= 1;
}
end
Thread.terminate if must_terminate
end
puts "will now start from #{start}"
# This begin makes no sense for me.
db.execute("BEGIN")
for ii in start..12000000 do
should_redo = false
# Instead of sleeping, we acquire the lock and check
# if we can create another thread. If we can't, we just
# release the lock and retry latter (using for-redo).
$cntMutex.synchronize{
if $threadCnt <= 7
$threadCnt += 1;
Thread.new { parsePrice(ii, db) }
else
# We use this flag since we don't know for sure redo's
# behavior inside a lock.
should_redo = true
end
}
# Will redo this iteration if we can't create the thread.
if should_redo
# Mitigate busy waiting a bit.
sleep(0.1)
redo
end
end
# This commit makes no sense to me.
db.execute('COMMIT')
Thread.list.each { |t| t.join }
re'net/http'
需要'sqlite3'
需要“zlib”
需要“json”
需要“线程”
$mutex=mutex.new#数据库锁和$cnt
$cntMutex=Mutex.new#锁定$threadCnt
$threadCnt=0#正在运行的线程数
$cnt=0#此提交到数据库中的行数
db=SQLite3::Database.new“price.db”
db.results\u as\u hash=true
STDOUT.sync=true
开始=10000000
def fetch(http,url,超时=10)
# ...
结束
价格(一分贝)
必须终止=错误
ss=fetch(Net::HTTP.start('p.3.cn',80),”http://p.3.cn/prices/get?skuid=J_#{i} ))
doc=JSON.parse(ss)[0]
放入“处理#{i}”
冲洗
开始
$mutex.synchronize{
$cnt=$cnt+1
db.执行(“插入价格(id,价格)值(?,)”,[i,文件[“p”]至_f])
如果$cnt>20
db.execute('COMMIT'))
db.execute('BEGIN')
$cnt=0
结束
}
rescue SQLite3::ConstraintException
警告(“重复id:#{i}”)
必须终止=真
救援指名员
警告(“匹配失败”)
营救
#在此处提升不会阻止Sure运行。
#在我们减少$threadCnt之后,它将增加
#保证条款。
提升
确保
$cntMutex.synchronize{
$threadCnt-=1;
}
结束
Thread.terminate如果必须终止
结束
放置“现在将从#{start}开始”
#这对我来说毫无意义。
db.execute(“开始”)
对于开始阶段的ii..12000000
是否应该重做=错误
#我们没有睡觉,而是得到锁并检查
#如果我们可以创建另一个线程。如果我们不能,我们只是
#释放锁并重试后者(用于重做)。
$cntMutex.synchronize{
如果$threadCnt