Ruby 将文档上载到FSCrawler以在Elasticsearch中编制索引的正确方法

Ruby 将文档上载到FSCrawler以在Elasticsearch中编制索引的正确方法,ruby,curl,rest-client,net-http,fscrawler,Ruby,Curl,Rest Client,Net Http,Fscrawler,我正在制作一个Rails应用程序的原型,将文档上传到FSCrawler(运行REST接口),以合并到Elasticsearch索引中。以他们为例,这是可行的: response = `curl -F "file=@#{params[:document][:upload].tempfile.path}" "http://127.0.0.1:8080/fscrawler/_upload?debug=true"` 文件被上传,内容被编入索引。这是我得到的一个例子: "{\n \"ok\" : tru

我正在制作一个Rails应用程序的原型,将文档上传到FSCrawler(运行REST接口),以合并到Elasticsearch索引中。以他们为例,这是可行的:

response = `curl -F "file=@#{params[:document][:upload].tempfile.path}" "http://127.0.0.1:8080/fscrawler/_upload?debug=true"`
文件被上传,内容被编入索引。这是我得到的一个例子:

"{\n \"ok\" : true,\n \"filename\" : \"RackMultipart20200130-91061-16swulg.pdf\",\n \"url\" : \"http://127.0.0.1:9200/local/_doc/d661edecf3e28572676e97a6f0d1d\",\n \"doc\" : {\n \"content\" : \"\\n \\n \\n\\nBasically, what you need to know is that Dante is all IP-based, and makes use of common IT standards. Each Dante device behaves \\n\\nmuch like any other network device you would already find on your network. \\n\\nIn order to make integration into an existing network easy, here are some of the things that Dante does: \\n\\n▪ Dante...
 {"_index":"local","_type":"_doc","_id":"72c9ecf2a83440994eb87d28786e6","_version":3,"_seq_no":26,"_primary_term":1,"found":true,"_source":{"content":"/var/folders/bn/pcc1h8p16tl534pw__fdz2sw0000gn/T/RackMultipart20200130-91061-134tcxn.pdf\n","meta":{},"file":{"extension":"pdf","content_type":"text/plain; charset=ISO-8859-1","indexing_date":"2020-01-30T15:33:45.481+0000","filename":"Similarity in Postgres and Rails using Trigrams · pganalyze.pdf"},"path":{"virtual":"Similarity in Postgres and Rails using Trigrams · pganalyze.pdf","real":"Similarity in Postgres and Rails using Trigrams · pganalyze.pdf"}}}
当我在命令行上运行
curl
时,我得到了一切,比如正确设置了“文件名”。如果我像上面一样使用它,在Rails控制器中,正如您所看到的,文件名被设置为Tempfile的文件名。这不是一个可行的解决办法。尝试使用
params[:document][:upload].tempfile
(没有
.path
)或仅使用
params[:document][:upload]
都会完全失败

我正试图以“正确的方式”做到这一点,但使用适当的HTTP客户机来做到这一点的每一次实践都失败了。我不知道如何调用HTTP POST,它将以
curl
(在命令行上)的方式向FSCrawler提交文件

在本例中,我只是尝试使用
Tempfile
file对象发送文件。出于某种原因,FSCrawler在注释中向我提供了错误信息,并获取了一些元数据,但没有为任何内容编制索引:

## Failed to extract [100000] characters of text for ...
## org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes
uri = URI("http://127.0.0.1:8080/fscrawler/_upload?debug=true")
request = Net::HTTP::Post.new(uri)
form_data = [['file', params[:document][:upload].tempfile,
  { filename: params[:document][:upload].original_filename,
  content_type: params[:document][:upload].content_type }]]
request.set_form form_data, 'multipart/form-data'
response = Net::HTTP.start(uri.hostname, uri.port) do |http|
  http.request(request)
end
如果我将上述内容更改为使用
params[:document][:upload].tempfile.path
,那么我不会得到关于InputStream的错误,但我也(仍然)不会得到任何内容索引。这是我得到的一个例子:

"{\n \"ok\" : true,\n \"filename\" : \"RackMultipart20200130-91061-16swulg.pdf\",\n \"url\" : \"http://127.0.0.1:9200/local/_doc/d661edecf3e28572676e97a6f0d1d\",\n \"doc\" : {\n \"content\" : \"\\n \\n \\n\\nBasically, what you need to know is that Dante is all IP-based, and makes use of common IT standards. Each Dante device behaves \\n\\nmuch like any other network device you would already find on your network. \\n\\nIn order to make integration into an existing network easy, here are some of the things that Dante does: \\n\\n▪ Dante...
 {"_index":"local","_type":"_doc","_id":"72c9ecf2a83440994eb87d28786e6","_version":3,"_seq_no":26,"_primary_term":1,"found":true,"_source":{"content":"/var/folders/bn/pcc1h8p16tl534pw__fdz2sw0000gn/T/RackMultipart20200130-91061-134tcxn.pdf\n","meta":{},"file":{"extension":"pdf","content_type":"text/plain; charset=ISO-8859-1","indexing_date":"2020-01-30T15:33:45.481+0000","filename":"Similarity in Postgres and Rails using Trigrams · pganalyze.pdf"},"path":{"virtual":"Similarity in Postgres and Rails using Trigrams · pganalyze.pdf","real":"Similarity in Postgres and Rails using Trigrams · pganalyze.pdf"}}}
如果我尝试使用RestClient,并尝试通过引用Tempfile的实际路径来发送文件,则会收到此错误消息,但什么也得不到:

## Unsupported media type
response = RestClient.post 'http://127.0.0.1:8080/fscrawler/_upload?debug=true',
  file: params[:document][:upload].tempfile.path,
  content_type: params[:document][:upload].content_type
如果我尝试
.read()
文件并提交该文件,那么我会破坏FSCrawler表单:

## Internal server error
request = RestClient::Request.new(
  :method => :post,
  :url => 'http://127.0.0.1:8080/fscrawler/_upload?debug=true',
  :payload => {
    :multipart => true,
    :file => File.read(params[:document][:upload].tempfile),
    :content_type => params[:document][:upload].content_type
})
response = request.execute
显然,我一直在尽我所能地尝试这一点,但我无法复制任何已知的基于Ruby的HTTP客户端所做的
curl
。我完全不知道如何让Ruby向FSCrawler提交数据,以使文档内容正确地索引。我做这件事的时间比我愿意承认的要长得多。我在这里遗漏了什么?

我最终尝试了一下,并根据这些尝试,得出了以下结论:

connection = Faraday.new('http://127.0.0.1:8080') do |f|
  f.request :multipart
  f.request :url_encoded
  f.adapter :net_http
end
file = Faraday::UploadIO.new(
  params[:document][:upload].tempfile.path,
  params[:document][:upload].content_type,
  params[:document][:upload].original_filename
)
payload = { :file => file }
response = connection.post('/fscrawler/_upload', payload)
当我越来越接近
curl
请求时,使用帮助我查看尝试的结果。这段代码几乎和curl一样发布请求。要通过代理路由此调用,我只需要添加
,proxy:'http://localhost:8866“
到连接设置的末尾