Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/loops/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Ruby on rails Ruby正则表达式在古腾堡提取中的应用_Ruby On Rails_Ruby_Regex_Seed - Fatal编程技术网

Ruby on rails Ruby正则表达式在古腾堡提取中的应用

Ruby on rails Ruby正则表达式在古腾堡提取中的应用,ruby-on-rails,ruby,regex,seed,Ruby On Rails,Ruby,Regex,Seed,我是Ruby的新手,我正在努力使用正则表达式从以下文本文件中为数据库添加种子: 我希望标记作为字典数据库的单词,而标记作为定义 我在这里可能会非常离谱(我只在一个数据库中植入了copy和pass;): 需要“打开uri” 字典。全部删除 g_text=打开('http://www.gutenberg.org/cache/epub/673/pg673.txt') y=g_text.read(/(.*)/) a=g_text.read(/(.*)/) 字典。创建!(:word=>y,:defi

我是Ruby的新手,我正在努力使用正则表达式从以下文本文件中为数据库添加种子:

我希望
标记作为字典数据库的单词,而
标记作为定义

我在这里可能会非常离谱(我只在一个数据库中植入了copy和pass;):

需要“打开uri”
字典。全部删除
g_text=打开('http://www.gutenberg.org/cache/epub/673/pg673.txt')   
y=g_text.read(/(.*)/)
a=g_text.read(/(.*)/)
字典。创建!(:word=>y,:definition=>a)
如您所见,每个
通常有多个
,这很好,因为我可以向表中添加定义1、定义2等的列

但是,为了确保每个定义都与前面的
标记位于同一行,这个正则表达式看起来是什么样的呢

谢谢你的帮助

编辑:

好吧,这就是我现在要尝试的:

doc.scan(Regexp.union(/<h1>(.*?)<\/h1>/, /<def>(.*?)<\/def>/)).map do |m, n|
  p [m,n]
end
doc.scan(Regexp.union(/(.*?/,/(.*?/)).map do | m,n|
p[m,n]
结束
我如何清除所有的nil条目

似乎正则表达式是在遇到错误时不中途停止整个文档的唯一方法…至少在尝试了几次其他解析器之后

我得出的结论(使用沙箱使用的本地提取物):

要求使用'pp'#以便SO在末尾漂亮地打印哈希

h1regex=“h1>(.+)(.+)一般来说,正则表达式不是解析html的正确工具,有解析DOM文档的库。特别是在你的情况下,找到一个正确的正则表达式会带来很大的麻烦。Ob Tony the Pony reference。我应该使用Nokogiri吗?谢谢你的回复Tensibai,但是在这个文本文件中,def标记不包含在h1分支上。也许我是太新手了,所以…如果你愿意的话,请告诉我在哪里else@Tensibai:文档既不是html文档,也不是格式良好的XML文档。使用DOM解析器完全不适用于这种大小的文件,因为您必须加载整个文件才能使用它并构建DOM树,因此这不是好方法。使用解析器的唯一可能方法是当遇到错误时,使用能够继续的XML拉式解析器(Nokogiri::XMLReader能够做到这一点吗?必须进行测试)。标记和语法在这里非常基本,因此在这种情况下在文件流上使用正则表达式远不是一个好主意。
doc.scan(Regexp.union(/<h1>(.*?)<\/h1>/, /<def>(.*?)<\/def>/)).map do |m, n|
  p [m,n]
end
require 'pp' # For SO to pretty print the hash at end

h1regex="h1>(.+)<\/h1" # Define the hl regex (avoid empty tags)
defregex="def>(.+)<\/def" # define the def regex (avoid empty tags)
# Initialize vars
defhash={}
key=nil
last=nil

open("./gut.txt") do |f|
  f.each_line do |l|
    newkey=l[/#{h1regex}/i,1] # get the next key (or nothing)
    if (newkey != last && newkey != nil) then # if we changed key, update the hash (some redundant hl entries with other defs)
        key = last = newkey # update current key
        defhash[key] = [] # init the new entry to empty array
    end
    if l[/#{defregex}/i] then
        defhash[key] << l[/#{defregex}/i,1] # we did match a def, add it to the current key array
    end
  end
end

pp defhash # print the result
{"A"=>
  [" The first letter of the English and of many other alphabets. The capital A of the alphabets of Middle and Western Europe, as also the small letter (a), besides the forms in Italic, black letter, etc., are all descended from the old Latin A, which was borrowed from the Greek <spn>Alpha</spn>, of the same form; and this was made from the first letter (<i>Aleph</i>, and itself from the Egyptian origin. The <i>Aleph</i> was a consonant letter, with a guttural breath sound that was not an element of Greek articulation; and the Greeks took it to represent their vowel <i>Alpha</i> with the \\'84 sound, the Ph\\'d2nician alphabet having no vowel symbols.",
   "The name of the sixth tone in the model major scale (that in C), or the first tone of the minor scale, which is named after it the scale in A minor. The second string of the violin is tuned to the A in the treble staff. -- A sharp (A#) is the name of a musical tone intermediate between A and B. -- A flat (A&flat;) is the name of a tone intermediate between A and G.",
   "In each; to or for each; <as>as, \"twenty leagues <ex>a</ex> day\", \"a hundred pounds <ex>a</ex> year\", \"a dollar <ex>a</ex> yard\", etc.</as>",
   "In; on; at; by.",
   "In process of; in the act of; into; to; -- used with verbal substantives in <i>-ing</i> which begin with a consonant. This is a shortened form of the preposition <i>an</i> (which was used before the vowel sound); as in <i>a</i> hunting, <i>a</i> building, <i>a</i> begging. \"Jacob, when he was <i>a</i> dying\" <i>Heb. xi. 21</i>.  \"We'll <i>a</i> birding together.\" \" It was <i>a</i> doing.\" <i>Shak.</i>  \"He burst out <i>a</i> laughing.\" <i>Macaulay</i>.  The hyphen may be used to connect <i>a</i> with the verbal substantive (as, <i>a</i>-hunting, <i>a</i>-building) or the words may be written separately. This form of expression is now for the most part obsolete, the <i>a</i> being omitted and the verbal substantive treated as a participle.",
   "Of.",
   " A barbarous corruption of <i>have</i>, of <i>he</i>, and sometimes of <i>it</i> and of <i>they</i>."],
 "Abalone"=>
  ["A univalve mollusk of the genus <spn>Haliotis</spn>. The shell is lined with mother-of-pearl, and used for ornamental purposes; the sea-ear. Several large species are found on the coast of California, clinging closely to the rocks."],
 "Aband"=>["To abandon.", "To banish; to expel."],
 "Abandon"=>
  ["To cast or drive out; to banish; to expel; to reject.",
   "To give up absolutely; to forsake entirely ; to renounce utterly; to relinquish all connection with or concern on; to desert, as a person to whom one owes allegiance or fidelity; to quit; to surrender.",
   "Reflexively : To give (one's self) up without attempt at self-control ; to yield (one's self) unrestrainedly ; -- often in a bad sense.",
   "To relinquish all claim to; -- used when an insured person gives up to underwriters all claim to the property covered by a policy, which may remain after loss or damage by a peril insured against."]}