Python 将多个txt文件读入Dict数据帧_Python_Pandas_Dataframe_Nlp

Python 将多个txt文件读入Dict数据帧

python pandas dataframe nlp

Python 将多个txt文件读入Dict数据帧,python,pandas,dataframe,nlp,Python,Pandas,Dataframe,Nlp,我正在尝试将多个txt文件加载到dataframe中。我知道如何加载URL、csv和excel，但我找不到任何关于如何将多个txt文件加载到dataframe并与dictionary或viceversa匹配的参考资料文本文件不是逗号或制表符分隔的，只是包含纯文本歌词的纯文本我检查了熊猫文件，欢迎任何帮助理想情况下，数据帧我希望实现的数据框架类似于此示例 |

我正在尝试将多个txt文件加载到dataframe中。我知道如何加载URL、csv和excel，但我找不到任何关于如何将多个txt文件加载到dataframe并与dictionary或viceversa匹配的参考资料

文本文件不是逗号或制表符分隔的，只是包含纯文本歌词的纯文本

我检查了熊猫文件，欢迎任何帮助

理想情况下，数据帧

我希望实现的数据框架类似于此示例

                 |                                                        lyrics
    -------------+-----------------------------------------------------------------------------------------
    bonjovi      |    some text from the text files HiHello! WelcomeThank you Thank you for coming.
    -------------+---------------------------------------------------------------------------------------
    lukebryan    |    some other text from the text files.Hi.Hello WelcomeThank you Thank you for coming. 
    -------------+-----------------------------------------------------------------------------------------
    johnprine    |    yet some text from the text files. Hi.Hello WelcomeThank you Thank you for coming.

基本示例文件夹结构/歌词/

urls = 

    'lyrics/bonjovi.txt',
    'lyrics/lukebryan.txt',
    'lyrics/johnprine.txt',
    'lyrics/brunomars.txt',
    'lyrics/methodman.txt',
    'lyrics/bobmarley.txt',
    'lyrics/nickcannon.txt',
    'lyrics/weeknd.txt',
    'lyrics/dojacat.txt',
    'lyrics/ladygaga.txt',
    'lyrics/dualipa.txt',
    'lyrics/justinbieber.txt',]

麝香名字

打开文本文件这些文件位于我运行Jupyter笔记本的目录/中

for i, c in enumerate(bands):
     with open("lyrics/" + c + ".txt", "wb") as file:
         pickle.dump(lyrics[i], file)

仔细检查以确保数据已正确加载希望得到这样的结果

录音键['bonjovi'，'lukebryan'，'johnprine'，'brunomars'，'methodman'，'bobmarley'，'nickcannon'，'weeknd'，'dojacat'，'ladygaga'，'dualipa'，'justinbieber']

# Combine it!
data_combined = {key: [combine_text(value)] for (key, value) in data.items()}


# We are going to change this to key: artist, value: string format
def combine_text(list_of_text):
    '''Takes a list of text and combines them into one large chunk of text.'''
    combined_text = ' '.join(list_of_text)
    return combined_text

我们可以将其保留为字典格式，也可以将其放入数据框中作为pd进口熊猫

pd.set_option('max_colwidth',150)

data_df = pd.DataFrame.from_dict(data_combined).transpose()
data_df.columns = ['lyrics']
data_df = data_df.sort_index()
data_df

我会这样做的。注意，我推广了文件操作，因此我不必担心手动创建密钥列表，并确保所有内容都匹配。

工作正常，从txt中提取名称也是一种更好的解决方案！！。谢谢你简洁的评论，这有助于我理解你的代码。再次感谢您抽出时间。

# Combine it!
data_combined = {key: [combine_text(value)] for (key, value) in data.items()}


# We are going to change this to key: artist, value: string format
def combine_text(list_of_text):
    '''Takes a list of text and combines them into one large chunk of text.'''
    combined_text = ' '.join(list_of_text)
    return combined_text

pd.set_option('max_colwidth',150)

data_df = pd.DataFrame.from_dict(data_combined).transpose()
data_df.columns = ['lyrics']
data_df = data_df.sort_index()
data_df

import os
import re
import pandas as pd

#get full path of txt file
filePath = []
for file in os.listdir("./lyrics"):
    filePath.append(os.path.join("./lyrics", file))

#pull file name from text file with regex, capturing the text before the .txt   
fileName = re.compile('\\\\(.*)\.txt')

#make empty dict Data with the key as the file name, and the value as the words in the file.
data = {}
for file in filePath:
    #capturing file name
    key = fileName.search(file)
    with open(file, "r") as readFile:
        # note that key[1] is the capture group from our search, and that the text is put into a list.
        data[key[1]] = [readFile.read()]

#make dataframe from dict, and rename columns.
df = pd.DataFrame(data).T.reset_index().rename(columns = {'index':'bands', 0:'lyrics'})