Amazon web services Can';读取CSV文件时,无法在Athena中获得正确的格式

Amazon web services Can';读取CSV文件时,无法在Athena中获得正确的格式,amazon-web-services,pyspark,aws-glue,amazon-athena,Amazon Web Services,Pyspark,Aws Glue,Amazon Athena,所以我在S3中有这个csv文件,我试图用胶水从雅典娜那里读取它。在我的电脑中看起来是这样的: +-------------------+-------------+------------+-------------+----------+----------------------------------------------------------------------------------------------------------------------------------

所以我在S3中有这个csv文件,我试图用胶水从雅典娜那里读取它。在我的电脑中看起来是这样的:

+-------------------+-------------+------------+-------------+----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+-----------+--------+---------+----+-----+
|tweet_id           |ticker_symbol|company_name|writer       |post_date |body                                                                                                                                                                                                                                                               |comment_num|retweet_num|like_num|post_time|year|month|
+-------------------+-------------+------------+-------------+----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+-----------+--------+---------+----+-----+
|1024364727789600768|TSLA         |Tesla Inc   |evacuationboy|2018-08-01|This kinda stuff never worked to sell cars, most of all super heavy no off road ones - $tsla                                                                                                                                                                       |0          |0          |1       |00:12:37 |2018|8    |
|1024392683119423488|TSLA         |Tesla Inc   |sbalatan     |2018-08-01|WHOA.Beside exceptionally detailed support for his already public Whistleblower claims, Martin Tripp also accuses $TSLA of serious accounting fraud.  The whole counterclaim must be read by any stakeholder of Tesla.Each paragraph is more damning than the next.|0          |0          |0       |02:03:42 |2018|8    |
|1024397232391553025|AAPL         |apple       |j_p_jacques  |2018-08-01|$AAPL 40% earning growth17% revenue growthat it trade for less than 10 2018 earning adjusted for net cashat that growth thy can return $900B to shareholder in less than 6 years                                                                                   |0          |2          |3       |02:21:47 |2018|8    |
|1024398885329010688|AAPL         |apple       |HaiderSF     |2018-08-01|Timely on $AAPL marketing from Ken Segall:                                                                                                                                                                                                                         |0          |0          |4       |02:28:21 |2018|8    |
|1024402095540260871|AAPL         |apple       |SignoreRomeo |2018-08-01|In Q3, $AAPL pay completed more total transactions than $SQ & more mobile transactions than $PYPL                                                                                                                                                                  |0          |1          |1       |02:41:06 |2018|8    |
+-------------------+-------------+------------+-------------+----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+-----------+--------+---------+----+-----+
我在glue catalog表中也配置了以下选项:

Serde serialization lib: org.apache.hadoop.hive.serde2.OpenCSVSerde
Serde parameters: escapeChar='\\' quoteChar='"' separatorChar=','
但问题是,内部带有逗号的列仍被拆分为单独的列,如下所示:

+------------------+-------------+------------+---------------+----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+----------+-----------------------------------+----+-----+
|tweet_id          |ticker_symbol|company_name|writer         |post_date |body                                                                                                                                                                                                              |comment_num                                                                                                                                                  |retweet_num  |like_num  |post_time                          |year|month|
+------------------+-------------+------------+---------------+----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+----------+-----------------------------------+----+-----+
|638421173571854340|AAPL         |apple       |plutonetworks  |2015-09-01|"""\""Move to android fast. ""\""@CNBCnow: Apple"""""Cisco announce partnership where Cisco will optimize networks for iOS users. • $AAPL $CSCO""\""\"""""0"1"00:10:43"                                           |2015                                                                                                                                                         |09           |null      |null                               |null|null |
|638435638623211520|AAPL         |apple       |_CardiacKid    |2015-09-01|$AAPL - Correction to Former Apple Supplier GT Advanced to Cut 40% of Workforce Article http://ih.advfn.com/p.php?pid=nmona&article=68350537&xref=newsalerttweet&adw=1126416…                                     |0                                                                                                                                                            |0            |0         |01:08:12                           |2015|09   |
|638447486319759360|GOOG         |Google Inc  |drwendellcraig_|2015-09-01|"""Sanofi buys into Google's #biotech  future"                                                                                                                                                                    | pairing up in #diabetes http://fiercebiotech.com/story/sanofi-buys-googles-biotech-future-pairing-diabetes/2015-08-31?utm_campaign=SocialMedia… $sny $goog""|0            |0         |0                                  |2015|09   |
|638598799171166208|AAPL         |apple       |PortfolioBuzz  |2015-09-01|Screen through high rated articles at once for US Tech Kings $AAPL $GOOG $FB http://cityfalcon.com/watchlists?name=US%20Tech%20Giants…                                                                            |0                                                                                                                                                            |0            |0         |11:56:32                           |2015|09   |
|638496755907194880|GOOG         |Google Inc  |ADVFNplc       |2015-09-01|$GOOG - India's Google Investigation Moves Forward http://uk.advfn.com/news/DJN/2015/article/68352601?xref=newsalerttweet&adw=1126416…                                                                            |0                                                                                                                                                            |0            |0         |05:11:03                           |2015|09   |
|638646145787559936|AAPL         |apple       |ArjunKharpal   |2015-09-01|Apple apparently wanting to do a Netflix and make original programming:  $AAPL                                                                                                               |1                                                                                                                                                            |0            |0         |15:04:41                           |2015|09   |
|638503830179704832|AAPL         |apple       |MarketSmith    |2015-09-01|Apple trails Fitbit in its first full quarter on the wearables market: https://washingtonpost.com/news/the-switch/wp/2015/08/28/apple-trails-fitbit-in-its-first-full-quarter-on-the-wearables-market/… $AAPL $FIT|0                                                                                                                                                            |1            |2         |05:39:10                           |2015|09   |
|638649417390784513|AAPL         |apple       |JohnyTradr     |2015-09-01|How Does Newmont Measure Up To Its Peers? http://snip.ly/JmG7 http://investwall.com #stocks #trading #investing $FB $AAPL                                                                                         |0                                                                                                                                                            |0            |0         |15:17:41                           |2015|09   |
|638591792024317952|"""AAPL"     | GOOG       | MSFT""        |"""apple" | Google Inc                                                                                                                                                                                                       | Microsoft""                                                                                                                                                 |PortfolioBuzz|2015-09-01|"""Stay ahead with Nasdaq 100 news"|2015|09   |
|638666551546220544|AAPL         |apple       |MoneyMarketzz  |2015-09-01|We Sent A SECRET New Penny Stock Pick To Our Platinum Members! Get Exclusive Special Access:  $DO $ALL $AAPL                                                                                    |0                                                                                                                                                            |0            |0         |16:25:46                           |2015|09   |
+------------------+-------------+------------+---------------+----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+----------+-----------------------------------+----+-----+
列沿每个逗号拆分。我无法找出我做错了什么,我也在堆栈溢出上查找了它,但没有任何帮助

提前谢谢

tweet_id,ticker_symbol,company_name,writer,post_date,body,comment_num,retweet_num,like_num,post_time
550501828547190784,AMZN,Amazon.com,Scott_Klemke,2015-01-01,"\"\"\"@WSJ: Jeff Bezos lost $7.4 billion in Amazon's worst year since 2008: http://on.wsj.com/1BmKuz3 $AMZN\"\"\"",0,0,0,09:30:38
550589778908164098,TSLA,Tesla Inc,Gold_prediction,2015-01-01,Stock Market Outlook: An Average Return Of 26.51% In A Year  #aggressives #risk #stocks $TSLA $GILD $RAD,0,0,0,15:20:07
550510764042121217,AAPL,apple,smoran26,2015-01-01,*FEATURE PRESENTATION* - THE MASSACRE: PART ONE - Stock of the Year - $AAPL Computer #Vegas #NYE  http://stephenjohnmoran.com/a-writers-diary/feature-presentation-the-massacre-part-one-stock-of-the-year-aapl-computer-vegas-nye… via @weebly,0,0,0,10:06:08
550670427388121088,MSFT,Microsoft,stockwire24,2015-01-01,Microsoft Corporation Is About to Abandon Internet Explorer $MSFT ,0,0,0,20:40:35
550612060946829312,AMZN,Amazon.com,caroltheva,2015-01-01,Free guide to understanding risk graphs $AMZN $FB $TWTR,0,0,0,16:48:39
550888743654400000,AAPL,apple,pricatti,2015-01-02,"#Balances #resultados #cartera #portfolio$aapl 26/01 EPS 2,56$king 02/02 EPS 0,37 $baba 16/02 $swhc 02/03 $csiq 03/03",0,0,1,11:08:06
550652617341566976,"AMZN, AAPL","Amazon.com, apple",SentiQuant,2015-01-01,#TOPTICKERTWEETS $IMRS $AAPL $SPY $IGN $FB $A $BABA $WAG $AMZN $QUAD #sentiquant 20150101 09:00:07:309,0,0,0,19:29:49
550927186169827330,MSFT,Microsoft,investingjungle,2015-01-02,"$IBM Possible inverse H&S forming on the daily, bullish with a close above 50sma towards gap resis $qqq $spy $MSFT  http://stks.co/t1Dwi",0,0,0,13:40:51
550693344372723712,AAPL,apple,MacHashNews,2015-01-01,Find the masks of power and master the elements in Lego Bionicle Mask of…  #AppAdvice $AAPL,0,0,1,22:11:39
551005616328949760,AAPL,apple,louwhiteman,2015-01-02,"Seeing fewer $AAPL products at my local $SBUX. Not sure if that is a #tech trend, or sign my neighborhood is on decline.",0,0,0,18:52:30
550781053137616896,AAPL,apple,CNBC,2015-01-02,This is Wall Street's top pick in 2015. Hint: it's NOT $AAPL or $GOOGL » http://cnb.cx/1xsBWIT,5,37,22,04:00:10
551016874700734464,AAPL,apple,aje0420,2015-01-02,Top ideas going into 2015 $AAPL $TWTR $VZ $C $UPS $YHOO #GLTA,0,0,0,19:37:14
550782974263037952,AAPL,apple,ramlousl,2015-01-02,"\"\"\"@CNBC: This is Wall Street's top pick in 2015. Hint: it's NOT $AAPL or $GOOGL » http://cnb.cx/1xsBWIT \"\"\"",0,0,0,04:07:48
551039971986264065,AAPL,apple,yonsu18,2015-01-02,"$AAPL last 5 days down on low volume, about 17 mil traded in the first hour, keep thinking funds getting out no bids taken today, all sell",0,0,0,21:09:01

坦率地说,我经常发现让AWS Glue确定表格式比手动配置更容易。以下是我使用的步骤:

  • 确保文件位于它们自己的目录中,因为Glue假定目录中的所有文件都是相同的数据格式(排序)
  • 在AWS Glue控制台中,添加一个新的爬虫程序
  • 添加有关爬虫程序的信息:给它命名
  • 指定爬网程序源类型:使用默认值
  • 添加数据存储:提供指向数据的“包含路径”,例如
    s3://bucket name/folder/
  • 添加另一个数据存储:
  • 选择IAM角色:创建IAM角色并为其命名(您可以稍后删除该角色)
  • 为此爬网程序创建计划:按需运行
  • 配置爬虫程序的输出:选择要在其中创建表的所需数据库
  • 完成
除名称、输入路径和IAM角色外,设置大多为默认值


然后,您可以运行crawler,大约一分钟后,它将在AWS Glue目录中创建一个表定义,该目录也应该出现在Amazon Athena中。(如果爬虫程序完成后它没有出现在Athena中,请参阅)。

您能删除serde参数并让Athena使用默认值吗?您能给我们展示一个实际CSV文件的示例吗?我猜想它不包含表格格式(
+-+-+-+-+
)?@prabhakarredy,我也试过了,但没有avail@JohnRotenstein,我添加了一个屏幕截图并粘贴了csv文件内容。不确定我是否可以在这里发布文件,如果这还不够,请让我知道,我会尝试将文件上传到google drive文件夹或其他地方,谢谢,我确实尝试过,但我得到了与上面相同的结果。我尝试设置序列化库和serde参数来解决这个问题,但得到了相同的结果。在编写csv文件时,我将分隔符改为“|”,现在一切正常。我仍在了解这一点,因此我不确定在行业中更改分隔符有多普遍。你能告诉我这是否可以接受吗?CSV格式的问题是逗号通常出现在文本中。这是通过将值括在引号内修复的。当文本中还包含引号时,这会导致问题,导致示例文件中出现
\“
符号。使用管道(
|
)是解决此问题的一个好方法,因为它在文本中非常罕见,所以很少与数据本身发生冲突。事实上,Hadoop使用管道作为默认分隔符,这是因为这个事实。底线:如果它适合您,那么就使用它!太好了!感谢您的帮助!