Python 3.x Python将具有多种语言的专栏翻译为英语_Python 3.x_Pandas_Translation_Google Translate

Python 3.x Python将具有多种语言的专栏翻译为英语

python-3.x pandas

Python 3.x Python将具有多种语言的专栏翻译为英语,python-3.x,pandas,translation,google-translate,Python 3.x,Pandas,Translation,Google Translate,我有一个数据集，其中有多个具有多种语言的评论栏，我想将这些栏翻译成英语，并创建包含所有英语翻译的新栏 Accountability_COMMENT是一列，每行有多个不同语言的注释。我想创建一个新的专栏，并将所有这些评论翻译成英语我尝试了以下代码： from googletrans import Translator from textblob import TextBlob translator = Translator() data_merge['Accountability_COM

我有一个数据集，其中有多个具有多种语言的评论栏，我想将这些栏翻译成英语，并创建包含所有英语翻译的新栏

Accountability_COMMENT是一列，每行有多个不同语言的注释。我想创建一个新的专栏，并将所有这些评论翻译成英语

我尝试了以下代码：

 from googletrans import Translator
 from textblob import TextBlob
 translator = Translator()
 data_merge['Accountability_COMMENT'] = data_merge['Accountability_COMMENT'].apply(lambda x: 
 TextBlob(x).translate(to='en'))

我得到的错误是：

TypeError:传递给_init__text的文本参数必须是字符串，而不是类“float”

我的列具有正确的objet格式

您很可能有一些注释只包含一个浮点值，即十进制数，即使它们是type:object，根据pandas，它们仍然被TextBlob解释为float。这会导致错误：

TypeError: The text argument passed to __init__(text) must be a string, not <class 'float'>

不幸的是，这可能也会导致如下错误：

raise NotTranslated('Translation API returned the input string unchanged.')
textblob.exceptions.NotTranslated: Translation API returned the input string unchanged.

这是因为在翻译数字时，译文和原文会完全相同，显然TextBlob不喜欢这样

要避免这种情况，可以捕获NotTranslated异常并返回未翻译的TextBlob，如下所示：

from textblob import TextBlob
from textblob.exceptions import NotTranslated    

def translate_comment(x):
    try:
        # Try to translate the string version of the comment
        return TextBlob(str(x)).translate(to='en')
    except NotTranslated:
        # If the output is the same as the input just return the TextBlob version of the input
        return TextBlob(str(x))

data_merge['Accountability_COMMENT'] = data_merge['Accountability_COMMENT'].apply(translate_comment)

编辑：如果您收到的HTTP错误请求太多，可能是因为您被Google Translate API踢出了。不使用apply，您可以使用for循环，在两个周期之间进行一些睡眠，从而使您的翻译更加缓慢。在这种情况下，您应该导入另一个包时间并替换最后一行：

from time import sleep
from textblob import TextBlob
from textblob.exceptions import NotTranslated    

def translate_comment(x):
    try:
        # Try to translate the string version of the comment
        return TextBlob(str(x)).translate(to='en')
    except NotTranslated:
        # If the output is the same as the input just return the TextBlob version of the input
        return TextBlob(str(x))

for i in range(len(data_merge['Accountability_COMMENT'])):
    # Translate one comment at a time
    data_merge['Accountability_COMMENT'].iloc[i] = translate_comment(data_merge['Accountability_COMMENT'].iloc[i])

    # Sleep for a quarter of second
    sleep(0.25)

然后，您可以为sleep函数尝试不同的值。当然，睡眠时间越长，翻译速度越慢！注意：睡眠参数以秒为单位。

您已经尝试过该代码。。。和你有什么错误吗？输出看起来与您想要的不一样？而且，我不清楚您的输入数据帧是什么样子，以及输出数据帧应该是什么样子。是有一列有注释，还是有许多列有注释？有一列包含大约3000条注释，全部使用不同的语言。。我需要将它们全部翻译成英语并存储到一个新的列中..我在一个测试数据框上测试了您的代码，该测试数据框只有三行不同语言的字符串，并且工作正常。错误消息实际上说了什么？TypeError:传递给_init__text的文本参数必须是字符串，而不是什么？not？TypeError后面是什么：传递给_init__text的文本参数必须是一个字符串，而不是一个字符串。这看起来真的很复杂，对我来说很有意义。但我仍然有一个错误：HTTPError:HTTPError 429:请求太多。我不知道如何解决这个问题。我编辑了我的答案，希望它能有所帮助。如果你觉得我的答案有用，请投票表决，如果它解决了你的问题，请接受！：我仍然犯同样的错误。。我在一个不同的单元格中尝试并删除了内容，然后在另一个单元格中导入了时间内容。但是我还是收到了同样的错误。一旦我们克服了这个错误，在同一个数据帧中，我还有大约10列需要做同样的事情。如果你想让所有代码正常工作，你需要同时解决两个不同的问题。一个是关于某些单元格中的错误类型，另一个是关于textblob在引擎盖下向Google发出的太多请求。请参阅我的更新代码，您必须使用带有异常的函数和带有睡眠函数的for循环。我不认为我可以做更多的事情来帮助你，这应该已经为你指明了正确的方向。

from time import sleep
from textblob import TextBlob
from textblob.exceptions import NotTranslated    

def translate_comment(x):
    try:
        # Try to translate the string version of the comment
        return TextBlob(str(x)).translate(to='en')
    except NotTranslated:
        # If the output is the same as the input just return the TextBlob version of the input
        return TextBlob(str(x))

for i in range(len(data_merge['Accountability_COMMENT'])):
    # Translate one comment at a time
    data_merge['Accountability_COMMENT'].iloc[i] = translate_comment(data_merge['Accountability_COMMENT'].iloc[i])

    # Sleep for a quarter of second
    sleep(0.25)