Ruby 红宝石线程&；互斥：为什么我的代码无法按顺序获取JSON？_Ruby_Multithreading_Mutex

Ruby 红宝石线程&；互斥：为什么我的代码无法按顺序获取JSON？

ruby multithreading

Ruby 红宝石线程&；互斥：为什么我的代码无法按顺序获取JSON？,ruby,multithreading,mutex,Ruby,Multithreading,Mutex,我编写了一个爬虫程序，它使用8个线程从Internet下载JSON： #encoding: utf-8 require 'net/http' require 'sqlite3' require 'zlib' require 'json' require 'thread' $mutex = Mutex.new # Lock of database and $cnt $cntMutex = Mutex.new # Lock of $threadCnt $threadCnt = 0 # number

我编写了一个爬虫程序，它使用8个线程从Internet下载JSON：

#encoding: utf-8
require 'net/http'
require 'sqlite3'
require 'zlib'
require 'json'
require 'thread'

$mutex = Mutex.new # Lock of database and $cnt
$cntMutex = Mutex.new # Lock of $threadCnt
$threadCnt = 0 # number of running threads 
$cnt = 0 # number of lines in this COMMIT to database

db = SQLite3::Database.new "price.db"
db.results_as_hash = true
STDOUT.sync = true
start = 10000000    
def fetch(http, url, timeout = 10) 
    # ...
end

def parsePrice( i, db)
        ss = fetch(Net::HTTP.start('p.3.cn',80), 'http://p.3.cn/prices/get?skuid=J_'+i.to_s)
        doc = JSON.parse(ss)[0]
        puts "processing "+i.to_s
        STDOUT.flush
        begin
                $mutex.synchronize {
                        $cnt = $cnt+1
                        db.execute("insert into prices (id, price) VALUES (?,?)", [i,doc["p"].to_f])
                        if $cnt > 20
                                db.execute('COMMIT')
                                db.execute('BEGIN')
                                $cnt = 0
                        end
                }
        rescue SQLite3::ConstraintException
                warn("duplicate id: "+i.to_s)
                $cntMutex.synchronize {
                        $threadCnt -= 1;
                }
                Thread.terminate
        rescue NoMethodError
                warn("Matching failed")
        rescue
                raise
        ensure
        end

        $cntMutex.synchronize {
                $threadCnt -= 1;
        }
end



puts "will now start from " + start.to_s()
db.execute("BEGIN")

Thread.new {
        for ii in start..12000000 do

                sleep 0.1 while $threadCnt > 7

                $cntMutex.synchronize {
                        $threadCnt += 1;
                }
                Thread.new { 
                        parsePrice( ii, db)
                }



        end
        db.execute('COMMIT')
} . join

然后我创建了一个名为

price.db

的数据库：

sqlite3 > create table prices (id INT PRIMATY KEY, price REAL);

为了使我的代码线程安全，

db

、

$cnt

、

$threadCnt

都受到

$mutex

或

$cntMutex

的保护

但是，当我尝试运行此脚本时，打印了以下消息：

[lz@lz crawl]$ ruby priceCrawler.rb 
will now start from 10000000
http://p.3.cn/prices/get?skuid=J_10000008http://p.3.cn/prices/get?skuid=J_10000008
http://p.3.cn/prices/get?skuid=J_10000008http://p.3.cn/prices/get?skuid=J_10000002http://p.3.cn/prices/get?skuid=J_10000008
http://p.3.cn/prices/get?skuid=J_10000008



http://p.3.cn/prices/get?skuid=J_10000002http://p.3.cn/prices/get?skuid=J_10000002

processing 10000002
processing 10000002processing 10000008processing 10000008processing 10000002

duplicate id: 10000002

duplicate id: 10000002processing 10000008
processing 10000008duplicate id: 10000008


duplicate id: 10000008processing 10000008
duplicate id: 10000008

这个脚本似乎跳过了某个id，并多次使用相同的id调用了

parsePrice

那么为什么会出现这种错误呢？任何帮助都将不胜感激。

在我看来，您的线程调度是错误的。我已经修改了您的代码，以说明您触发的一些可能的比赛条件

re 'net/http'
require 'sqlite3'
require 'zlib'
require 'json'
require 'thread'

$mutex = Mutex.new # Lock of database and $cnt
$cntMutex = Mutex.new # Lock of $threadCnt
$threadCnt = 0 # number of running threads 
$cnt = 0 # number of lines in this COMMIT to database

db = SQLite3::Database.new "price.db"
db.results_as_hash = true
STDOUT.sync = true
start = 10000000    
def fetch(http, url, timeout = 10) 
  # ...
end

def parsePrice(i, db)
  must_terminate = false

  ss = fetch(Net::HTTP.start('p.3.cn',80), "http://p.3.cn/prices/get?skuid=J_#{i}")
  doc = JSON.parse(ss)[0]
  puts "processing #{i}"
  STDOUT.flush
  begin
    $mutex.synchronize {
      $cnt = $cnt+1
      db.execute("insert into prices (id, price) VALUES (?,?)", [i,doc["p"].to_f])
      if $cnt > 20
        db.execute('COMMIT')
        db.execute('BEGIN')
        $cnt = 0
      end
    }
  rescue SQLite3::ConstraintException
    warn("duplicate id: #{i}")
    must_terminate = true
  rescue NoMethodError
    warn("Matching failed")
  rescue
    # Raising here does not prevent ensure from running.
    # It will raise after we decrement $threadCnt on
    # ensure clause.
    raise
  ensure
    $cntMutex.synchronize {
      $threadCnt -= 1;
    }
  end

  Thread.terminate if must_terminate
end

puts "will now start from #{start}"

# This begin makes no sense for me.
db.execute("BEGIN")

for ii in start..12000000 do
  should_redo = false

  # Instead of sleeping, we acquire the lock and check
  # if we can create another thread. If we can't, we just 
  # release the lock and retry latter (using for-redo).
  $cntMutex.synchronize{
    if $threadCnt <= 7
      $threadCnt += 1;
      Thread.new { parsePrice(ii, db) }
    else
      # We use this flag since we don't know for sure redo's
      # behavior inside a lock.
      should_redo = true
    end

  }

  # Will redo this iteration if we can't create the thread.
  if should_redo
    # Mitigate busy waiting a bit.
    sleep(0.1)
    redo
  end
end

# This commit makes no sense to me.
db.execute('COMMIT')

Thread.list.each { |t| t.join }

re'net/http'
需要'sqlite3'
需要“zlib”
需要“json”
需要“线程”
$mutex=mutex.new#数据库锁和$cnt
$cntMutex=Mutex.new#锁定$threadCnt
$threadCnt=0#正在运行的线程数
$cnt=0#此提交到数据库中的行数
db=SQLite3:：Database.new“price.db”
db.results\u as\u hash=true
STDOUT.sync=true
开始=10000000
def fetch（http，url，超时=10）
# ...
结束
价格（一分贝）
必须终止=错误
ss=fetch（Net:：HTTP.start（'p.3.cn'，80），”http://p.3.cn/prices/get?skuid=J_#{i} ））
doc=JSON.parse（ss）[0]
放入“处理#{i}”
冲洗
开始
$mutex.synchronize{
$cnt=$cnt+1
db.执行（“插入价格（id，价格）值（？，）”，[i，文件[“p”]至_f]）
如果$cnt>20
db.execute（'COMMIT'））
db.execute（'BEGIN'）
$cnt=0
结束
}
rescue SQLite3:：ConstraintException
警告（“重复id:#{i}”）
必须终止=真
救援指名员
警告（“匹配失败”）
营救
#在此处提升不会阻止Sure运行。
#在我们减少$threadCnt之后，它将增加
#保证条款。
提升
确保
$cntMutex.synchronize{
$threadCnt-=1；
}
结束
Thread.terminate如果必须终止
结束
放置“现在将从#{start}开始”
#这对我来说毫无意义。
db.execute（“开始”）
对于开始阶段的ii..12000000
是否应该重做=错误
#我们没有睡觉，而是得到锁并检查
#如果我们可以创建另一个线程。如果我们不能，我们只是
#释放锁并重试后者（用于重做）。
$cntMutex.synchronize{
如果$threadCnt