Hadoop读取块上的Python_Python_Hadoop_Hdfs

Hadoop读取块上的Python

python hadoop

Hadoop读取块上的Python,python,hadoop,hdfs,Python,Hadoop,Hdfs,我有以下问题。我想从hdfs（一个称为“投诉”的表）中提取数据。我写了下面的脚本，它实际上是有效的： import pandas as pd from hdfs import InsecureClient import os file = open ("test.txt", "wb") print ("Step 1") client_hdfs = InsecureClient ('http://XYZ') N = 10 print ("Step 2") with client_hdfs.re

我有以下问题。我想从hdfs（一个称为“投诉”的表）中提取数据。我写了下面的脚本，它实际上是有效的：

import pandas as pd
from hdfs import InsecureClient
import os

file = open ("test.txt", "wb")

print ("Step 1")
client_hdfs = InsecureClient ('http://XYZ')
N = 10
print ("Step 2")
with client_hdfs.read('/user/.../complaint/000000_0') as reader:
    print('new line')
    features = reader.read(1000000)
    file.write(features)
    print('end')
file.close()

我现在的问题是文件夹“complaint”包含4个文件（我不知道是哪种文件类型），读取操作会返回无法进一步使用的字节（我将其保存到一个文本文件作为测试，看起来是这样的：

在HDFS中，它如下所示：

我现在的问题是：是否有可能以一种有意义的方式将每一列的数据分开

我只找到了带有.csv文件之类的解决方案，不知怎的卡在了这里…：-）

编辑我对我的解决方案进行了更改并尝试了不同的方法，但没有一种方法能够真正起作用。以下是更新的代码：

import pandas as pd
from hdfs import InsecureClient
import os
import pypyodbc
import pyspark
from pyspark import SparkConf, SparkContext
from hdfs3 import HDFileSystem
import pyarrow.parquet as pq
import pyarrow as pa
from pyhive import hive


#Step 0: Configurations
#Connections with InsecureClient (this basically works)
#Notes: TMS1 doesn't work because of txt files
#insec_client_tms1 = InsecureClient ('http://some-adress:50070')
insec_client_tms2 = InsecureClient ('http://some-adress:50070')

#Connection with Spark (not working at the moment)
#Error: Java gateway process exited before sending its port number
#conf = SparkConf().setAppName('TMS').setMaster('spark://adress-of-node:7077')
#sc = SparkContext(conf=conf)

#Connection via PyArrow (not working)
#Error: File not found
#fs = pa.hdfs.connect(host='hdfs://node-adress', port =8020)
#print("FS: " + fs)

#connection via HDFS3 (not working)
#The module couldn't be load
#client_hdfs = HDFileSystem(host='hdfs://node-adress', port=8020)

#Connection via Hive (not working)
#no module named sasl -> I tried to install it, but it also fails
#conn = hive.Connection(host='hdfs://node-adress', port=8020, database='deltatest')

#Step 1: Extractions
print ("starting Extraction")
#Create file
file = open ("extraction.txt", "w")


#Extraction with Spark
#text = sc.textFile('/user/hive/warehouse/XYZ.db/baseorder_flags/000000_0')
#first_line = text.first()
#print (first_line)

#extraction with hive
#df = pd.read_sql ('select * from baseorder',conn)
#print ("DF: "+ df)

#extraction with hdfs3
#with client_hdfs.open('/home/deltatest/basedeviation/000000_0') as f:
 #   df = pd.read_parquet(f)


#Extraction with Webclient (not working)
#Error: Arrow error: IOError: seek -> fastparquet has a similar error
with insec_client_tms2.read('/home/deltatest/basedeviation/000000_0') as reader:
   features = pd.read_parquet(reader)
   print (features)
   #features = reader.read()
   #data = features.decode('utf-8', 'replace')
   print("saving data to file")
   file.write(data)
   print('end')

file.close()

您不能以明文形式读取文件块。您的文件归Hive group所有，所以它们是Hive表的一部分吗？我们的提供商对此并不十分了解，但我想是的。至少我可以说（在浏览目录时）是这样的：/user/hive/XYZ.db/complaint您可以连接到hive并运行

SHOW CREATE TABLE XYZ.complaint

？我们无法通过putty或类似的方式访问节点。只需在web界面上查找一点。让我们假设它确实是一个hive表（在我看来这是最有意义的），解决我的问题的方法是什么？您可以使用

pyhive

或

pyspark

连接到hive。或者使用其他JDBC/ODBC工具。您不需要SSH会话。无论如何，我在这里的观点是，如果数据在配置单元中，它不一定是Python可以读取的纯文本。它可能是兽人/拼花地板，或其他东西