Ruby on rails 清除WordPress导出内容时UTF-8中的字节序列无效_Ruby On Rails_Ruby_Utf 8_Nokogiri_Sanitize

Ruby on rails 清除WordPress导出内容时UTF-8中的字节序列无效

ruby-on-rails ruby utf-8

Ruby on rails 清除WordPress导出内容时UTF-8中的字节序列无效,ruby-on-rails,ruby,utf-8,nokogiri,sanitize,Ruby On Rails,Ruby,Utf 8,Nokogiri,Sanitize,我正在编写一个脚本，将WordPress内容导入我的Rails应用程序。我需要删除文章正文中的所有图像。查看帖子时，我发现UTF-8中的字节序列无效 require 'action_view' require 'nokogiri' require 'sanitize' namespace :wordpress do desc 'Import Worpress Posts' task import_posts: :environment do |_, _args| IMAGE_R

我正在编写一个脚本，将WordPress内容导入我的Rails应用程序。我需要删除文章正文中的所有图像。查看帖子时，我发现UTF-8中的

字节序列无效
require 'action_view'
require 'nokogiri'
require 'sanitize'

namespace :wordpress do
  desc 'Import Worpress Posts'
  task import_posts: :environment do |_, _args|
    IMAGE_REGEX = /"([a-z\-_0-9\/\:\.]*\.(jpg|jpeg|png|gif))"/i
    user_id = User.first[:id]
    Blogit::Post.destroy_all
    File.open('lib/post.xml') do |file|
      items = Nokogiri::XML(file).xpath('//channel//item')
      items.each do |item|
        body = Sanitize.fragment(item.at_xpath('content:encoded').text).force_encoding('UTF-8')
               .encode('UTF-16', invalid: :replace, replace: '')
               .encode('UTF-8')

        begin
          post = Blogit::Post.create(
            title: item.at_xpath('wp:post_name').text.strip,
            body: body,
            blogger_id: user_id,
            bootsy_image_gallery: Bootsy::ImageGallery.create
          )
          images = item.at_xpath('content:encoded').text.scan(IMAGE_REGEX).map(&:first)
          post.save(validate: false)
          # post.update_column(:created_at, item.at_xpath('wp:post_date_gmt').text + ' +0000')
          # if images.any?
          #   images.each do |image|
          #     post.remote_feature_image_url = image.first
          #     post.bootsy_image_gallery.images << Bootsy::Image.create(remote_image_file_url: image.first)
          #   end
          #   post.save
          # end
        rescue StandardError => e
          puts "#{e}"
          next
        end
      end
    end
  end
end

这是_post.html.slim：
= content_tag(:article, id: "blog_post_#{post.id}", class: "blog_post") do
  / Render the header for this blog post
  = render "blogit/posts/post_head", post: post

  / Render Post Image Slider
  / = render "blogit/posts/slider", images: post.bootsy_image_gallery.images if post.bootsy_image_gallery.images.any?

  / Render the body of this blog post (as Markdown)
  = render "blogit/posts/post_body", post: post

  / Render admin links to edit/delete this post
  = render "blogit/posts/post_links", post: post

  / Render info about the person who wrote this post
  = render "blogit/posts/blogger_information", post: post

  = render 'elements/tags', post: post

  / Render the no. of comments
  - if defined?(show_comments_count) and show_comments_count
    = render "blogit/posts/comments_count", post: post

只需调用String#scrub
或#scrub即可删除非法字节，存在于MRI 2.1.0及后续版本中
body = Sanitize.fragment(item.at_xpath('content:encoded').text).force_encoding('UTF-8').scrub

没有必要这样做
.encode('UTF-16', invalid: :replace, replace: '').encode('UTF-8')

台词。您正在尝试执行scrub
实际执行的操作，只需使用scrub

这将阻止您获得异常，具体取决于实际引发异常的位置。你没有给我们一个例外的行号。您可能需要清理
从XML中获取的其他数据，例如标题和图像
通过使用Unicode替换字符替换所有无效字节（�）。但是，它是否是正确的解决方案取决于源文本的情况，以及为什么其中包含无效的UTF-8字节。如果您只有一些�这里和那里，可能只是有一些坏字节。如果您发现所有或许多重音字符或非ASCII字符被替换为�，然后您必须找出编码被破坏的原因并正确修复它
 您说您的错误被抛出：
= content_tag(:article, id: "blog_post_#{post.id}", class: "blog_post") do

但是，这一行甚至没有出现在您粘贴在上面的源代码中
如果该行确实抛出了错误，则表示在post.id
中存在非法字节。这似乎不太可能。但是如果你真的这么做了，你可以通过scrub
ing post.id来解决“非法字节”异常
content_tag(:article, id: "blog_post_#{post.id.scrub}", class: "blog_post") do

但这可能只会导致进一步的问题。如果真的是这样的话，那么首先必须弄清楚post.id中为什么存在非法字节，并解决根本问题
然而，我对此表示怀疑，我认为您没有准确地诊断出哪一行引发了异常
祝你好运
 这也没用body=Sanitize.fragment（item.at_xpath（'content:encoded'）.text）.scrub
或body=Sanitize.fragment（item.at_xpath（'content:encoded'）.text.scrub），您需要确切地告诉我们哪一行引发了异常。异常附带一个堆栈跟踪，显示在引发异常时您使用的代码和gem代码的哪些行涉及到异常。这就是开始调试它时需要注意的问题，您首先需要确切地知道是哪一行导致了它，而不是说您使用的是哪个版本的Ruby和Rails。当前红宝石，v2.0+默认为UTF-8。您也没有向我们展示传入XML的最小示例，这将引发问题。没有这一点，你让我们猜测和想象，这对任何人都没有好处。
content_tag(:article, id: "blog_post_#{post.id.scrub}", class: "blog_post") do