Text quanteda中的自定义词典_Text_Encoding_Quanteda

Text quanteda中的自定义词典

text encoding

Text quanteda中的自定义词典,text,encoding,quanteda,Text,Encoding,Quanteda,我需要做LIWC（语言查询和字数统计），我正在使用quanteda/quanteda.dictionary。我需要“加载”自定义词典：我将单词列表保存为单独的.txt文件，并通过readlines“加载”（仅一个词典的示例）：并获取以下错误： Error in stri_replace_all_charclass(value, "\\p{Z}", concatenator) : invalid UTF-8 byte sequence detected; perhaps you shoul

我需要做LIWC（语言查询和字数统计），我正在使用quanteda/quanteda.dictionary。我需要“加载”自定义词典：我将单词列表保存为单独的.txt文件，并通过readlines“加载”（仅一个词典的示例）：

并获取以下错误：

Error in stri_replace_all_charclass(value, "\\p{Z}", concatenator) : 


invalid UTF-8 byte sequence detected; perhaps you should try calling stri_enc_toutf8()

显然，问题在于我的txt文件。我有相当多的字典，并将它们作为文件加载

如何修复此错误？在readlines中指定编码似乎没有帮助

这是文件

更新：在Mac上解决这个问题的最简单方法是用Word而不是TextEdit打开.txt文件。Word提供不同于默认文本编辑的编码选项

好的，问题不在于编码，因为您链接的文件中的所有内容都可以完全用小写128字符的ASCII编码。问题是空行造成的空白。还有一些前导空间需要删除。使用一些子集和一些stringi清理操作很容易做到这一点

库（“quanteda”）
##软件包版本：1.3.14
autonomy如果没有可复制的文件示例，就不可能知道错误的来源。但听起来您的输入单词列表并没有编码为UTF-8。readlines（）中的“encoding”参数不会对文件重新编码，它只告诉R将文本视为UTF-8。我的建议是在文本编辑器中打开该文件，并将其显式保存为UTF-8。或者，提供一个指向该文件的链接以使问题重现。谢谢Ken，在那里添加了一个链接，我在Mac上，当我在TextEdit中打开并保存它时，它不会给我encodingKen选项，谢谢。我确实做了工作，有趣的是，保存在word中也能起作用。
txt <- c("12th Battalion Productions is producing a fully holographic feature length production. Presenting a 3D audio-visual projection without a single cast member present, to give the illusion of live stage performance.")

liwcalike(txt, EODic, what = "word")

Error in stri_replace_all_charclass(value, "\\p{Z}", concatenator) : 


invalid UTF-8 byte sequence detected; perhaps you should try calling stri_enc_toutf8()