Python 我正在尝试执行外部连接_Python_Pandas

Python 我正在尝试执行外部连接

python pandas

Python 我正在尝试执行外部连接,python,pandas,Python,Pandas,我正在尝试联接执行外部联接，并在处获取错误使用外部连接从没有发布过厌食/肥胖信息的用户那里获取评论我还在联接中使用.set_index，但它在以下行中给了我一个错误： neither_df = neither_df[neither_df['author_right'].isnull()] 完整代码 from tqdm import tqdm csv_filename = 'full_data_preprocessed.csv' chunksize = 10000 count = 0 ob

我正在尝试联接执行外部联接，并在处获取错误

使用外部连接从没有发布过厌食/肥胖信息的用户那里获取评论

我还在联接中使用

.set_index

，但它在以下行中给了我一个错误：

neither_df = neither_df[neither_df['author_right'].isnull()]

完整代码

from tqdm import tqdm

csv_filename = 'full_data_preprocessed.csv'
chunksize = 10000
count = 0
obesity_author_data_frames = []
anorexia_author_data_frames = []
neither_author_data_frames = []

anorexia_record_count = 0
obesity_record_count = 0
neither_record_count = 0

for chunk in tqdm(pd.read_csv(csv_filename, chunksize=chunksize)):
    chunk['author'] = chunk['author'].apply(lambda a : hashlib.md5(a.encode()).hexdigest())
    anorexia_df = anorexia_authors.join(chunk.set_index('author'), on='author', how='inner', lsuffix='_left', rsuffix='_right')
    if anorexia_record_count < 10000 and not anorexia_df.empty:
        anorexia_author_data_frames.append(anorexia_df)
        anorexia_record_count += len(anorexia_df)

    obesity_df = obesity_authors.join(chunk.set_index('author'), on='author', how='inner', lsuffix='_left', rsuffix='_right')
    if obesity_record_count < 10000 and not obesity_df.empty:
        obesity_author_data_frames.append(obesity_df)
        obesity_record_count += len(obesity_df)

    # Use an outer join to get comments from users who have not posted about anorexia/obesity.
    neither_df = chunk.join(both_authors, on='author', how='outer', lsuffix='_left', rsuffix='_right')
    neither_df = neither_df[neither_df['author_right'].isnull()]
    if neither_record_count < 10000 and not neither_df.empty:
        neither_author_data_frames.append(neither_df)
        neither_record_count += len(neither_df)

    count += 1
    if anorexia_record_count > 10000 and obesity_record_count > 10000 and neither_record_count > 10000:
        break

该错误意味着您试图用作要联接的列（作者）的列在每个表中都有不同的数据类型-第一个（区块）是字符串，第二个（两个作者）是int。应通过以下方式之一转换第一个数据帧列的类型：

chunk['author']=chunk['author'].astype(int)

或：

chunk['author']=chunk['author'].astype(int)

或：

当我使用它时，它也会给出错误，即。。。……………………。。。………。。。…。。。……。。。…。ValueError:以10为基数的int（）的文本无效：“49d264a69d92ec57c908cdb64cb30931”我也尝试了两个作者。为join设置了索引（“作者”），它给出了错误：KEYERROR…author…right然后我尝试了merge，它也给出了错误KEYERROR…author…right有意义，因为这个数据确实不像我最初认为的那样是int，而是字符串。因此，您需要将int类型转换为string。试试这个：

two\u authors.author.astype（int）

我已经在将区块和two\u author转换为十六进制

import hashlib anorexia\u authors=anorexia asubredits.drop\u duplicates（subset=“author”）['author'].apply（lambda:hashlib.md5（a.encode（）.hexdigest（））.to\u frame（）肥胖\u authors=obesubsitysubredits.drop\u duplicates（subset=“author”）['author'].apply（lambda:hashlib.md5（a.encode（））.hexdigest（））.to_frame（）两个_authors=bothSubreddits.drop_重复项（subset=“author”）['author'].apply（lambda:hashlib.md5（a.encode（））.hexdigest（））to_frame（））

当我使用它时，它也会给出错误，错误是………..ValueError:int（）的无效文本，以10为基数：“49d264a69d92ec57c908cdb64cb30931”我还尝试了两个作者。设置索引（“作者”）对于join，它给出了error:KEYERROR…author\u right，然后我尝试了merge，它也给出了error KEYERROR……author\u right有意义，因为这个数据确实不是我最初认为的int，而是string。因此您需要将int类型转换为string。试试这个：

two\u authors.author.astype（int）

我已经在将区块和两个作者转换为十六进制

导入hashlib厌食症作者=厌食症患者。删除重复项（subset=“author”）['author'].apply（lambda:hashlib.md5（a.encode（））.hexdigest（）。to_frame（）肥胖作者=肥胖患者subreddits.drop_重复项（subset=“author”）['author']）。apply（lambda:hashlib.md5（a.encode（））.hexdigest（））.to_frame（）两个作者都=两个subreddits.drop_副本（subset=“author”）['author']）。应用（lambda:hashlib.md5（a.encode（））.hexdigest（））。to_frame（）

chunk['author']=chunk['author'].astype(int)

chunk.author.astype(int)