Hadoop 无法使用Flume检索Twitter流数据

Hadoop 无法使用Flume检索Twitter流数据,hadoop,hdfs,flume,flume-ng,flume-twitter,Hadoop,Hdfs,Flume,Flume Ng,Flume Twitter,我正在尝试使用Flume流式传输和检索Twitter数据,但由于某种错误而无法这样做。 当我尝试使用命令执行它时: flume-ng agent -n TwitterAgent -c conf -f /home/hadoop/Flume/conf/twitter.conf 我得到以下信息: Info: Including Hadoop libraries found via (/home/hadoop/hadoop-2.10.1/bin/hadoop) for HDFS access Info

我正在尝试使用Flume流式传输和检索Twitter数据,但由于某种错误而无法这样做。 当我尝试使用命令执行它时:

flume-ng agent -n TwitterAgent -c conf -f /home/hadoop/Flume/conf/twitter.conf
我得到以下信息:

Info: Including Hadoop libraries found via (/home/hadoop/hadoop-2.10.1/bin/hadoop) for HDFS access
Info: Including HBASE libraries found via (/home/hadoop/hbase-2.2.5/bin/hbase) for HBASE access
Info: Including Hive libraries found via (/home/hadoop/apache-hive-2.3.7-bin) for Hive access
+ exec /usr/lib/jvm/java-8-openjdk-amd64/bin/java -Xmx20m -cp 'conf:/home/hadoop/Flume/lib/*:/home/hadoop/hadoop-2.10.1/etc/hadoop:/home/hadoop/hadoop-2.10.1/share/hadoop/common/lib/*:/home/hadoop/hadoop-2.10.1/share/hadoop/common/*:/home/hadoop/hadoop-2.10.1/share/hadoop/hdfs:/home/hadoop/hadoop-2.10.1/share/hadoop/hdfs/lib/*:/home/hadoop/hadoop-2.10.1/share/hadoop/hdfs/*:/home/hadoop/hadoop-2.10.1/share/hadoop/yarn:/home/hadoop/hadoop-2.10.1/share/hadoop/yarn/lib/*:/home/hadoop/hadoop-2.10.1/share/hadoop/yarn/*:/home/hadoop/hadoop-2.10.1/share/hadoop/mapreduce/lib/*:/home/hadoop/hadoop-2.10.1/share/hadoop/mapreduce/*:/home/hadoop/hadoop-2.10.1/contrib/capacity-scheduler/*.jar:/home/hadoop/hbase-2.2.5/conf:/usr/lib/jvm/java-8-openjdk-amd64/lib/tools.jar:/home/hadoop/hbase-2.2.5:/home/hadoop/hbase-2.2.5/lib/shaded-clients/hbase-shaded-client-byo-hadoop-2.2.5.jar:/home/hadoop/hbase-2.2.5/lib/client-facing-thirdparty/audience-annotations-0.5.0.jar:/home/hadoop/hbase-2.2.5/lib/client-facing-thirdparty/commons-logging-1.2.jar:/home/hadoop/hbase-2.2.5/lib/client-facing-thirdparty/findbugs-annotations-1.3.9-1.jar:/home/hadoop/hbase-2.2.5/lib/client-facing-thirdparty/htrace-core4-4.2.0-incubating.jar:/home/hadoop/hbase-2.2.5/lib/client-facing-thirdparty/log4j-1.2.17.jar:/home/hadoop/hbase-2.2.5/lib/client-facing-thirdparty/slf4j-api-1.7.25.jar:/home/hadoop/hadoop-2.10.1/etc/hadoop:/home/hadoop/hadoop-2.10.1/share/hadoop/common/lib/*:/home/hadoop/hadoop-2.10.1/share/hadoop/common/*:/home/hadoop/hadoop-2.10.1/share/hadoop/hdfs:/home/hadoop/hadoop-2.10.1/share/hadoop/hdfs/lib/*:/home/hadoop/hadoop-2.10.1/share/hadoop/hdfs/*:/home/hadoop/hadoop-2.10.1/share/hadoop/yarn:/home/hadoop/hadoop-2.10.1/share/hadoop/yarn/lib/*:/home/hadoop/hadoop-2.10.1/share/hadoop/yarn/*:/home/hadoop/hadoop-2.10.1/share/hadoop/mapreduce/lib/*:/home/hadoop/hadoop-2.10.1/share/hadoop/mapreduce/*:/home/hadoop/hadoop-2.10.1/contrib/capacity-scheduler/*.jar:/home/hadoop/hbase-2.2.5/conf:/home/hadoop/apache-hive-2.3.7-bin/lib/*' -Djava.library.path=:/home/hadoop/hadoop-2.10.1/lib/native:/home/hadoop/hadoop-2.10.1/lib/native org.apache.flume.node.Application -n TwitterAgent -f /home/hadoop/Flume/conf/twitter.conf
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/Flume/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/hadoop-2.10.1/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/apache-hive-2.3.7-bin/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20/11/20 02:23:44 INFO node.PollingPropertiesFileConfigurationProvider: Configuration provider starting
20/11/20 02:23:44 INFO node.PollingPropertiesFileConfigurationProvider: Reloading configuration file:/home/hadoop/Flume/conf/twitter.conf
20/11/20 02:23:44 INFO conf.FlumeConfiguration: Processing:HDFS
20/11/20 02:23:44 INFO conf.FlumeConfiguration: Processing:MemChannel
20/11/20 02:23:44 INFO conf.FlumeConfiguration: Processing:Twitter
20/11/20 02:23:44 INFO conf.FlumeConfiguration: Processing:Twitter
20/11/20 02:23:44 INFO conf.FlumeConfiguration: Processing:HDFS
20/11/20 02:23:44 INFO conf.FlumeConfiguration: Processing:HDFS
20/11/20 02:23:44 INFO conf.FlumeConfiguration: Processing:MemChannel
20/11/20 02:23:44 INFO conf.FlumeConfiguration: Processing:HDFS
20/11/20 02:23:44 INFO conf.FlumeConfiguration: Processing:HDFS
20/11/20 02:23:44 INFO conf.FlumeConfiguration: Processing:MemChannel
20/11/20 02:23:44 INFO conf.FlumeConfiguration: Processing:HDFS
20/11/20 02:23:44 INFO conf.FlumeConfiguration: Added sinks: HDFS Agent: TwitterAgent
20/11/20 02:23:44 INFO conf.FlumeConfiguration: Processing:Twitter
20/11/20 02:23:44 INFO conf.FlumeConfiguration: Processing:Twitter
20/11/20 02:23:44 INFO conf.FlumeConfiguration: Processing:HDFS
20/11/20 02:23:44 INFO conf.FlumeConfiguration: Processing:HDFS
20/11/20 02:23:44 INFO conf.FlumeConfiguration: Processing:Twitter
20/11/20 02:23:44 INFO conf.FlumeConfiguration: Processing:Twitter
20/11/20 02:23:44 INFO conf.FlumeConfiguration: Processing:Twitter
20/11/20 02:23:44 INFO conf.FlumeConfiguration: Processing:HDFS
20/11/20 02:23:44 WARN conf.FlumeConfiguration: Agent configuration for 'TwitterAgent' has no configfilters.
20/11/20 02:23:44 INFO conf.FlumeConfiguration: Post-validation flume configuration contains configuration for agents: [TwitterAgent]
20/11/20 02:23:44 INFO node.AbstractConfigurationProvider: Creating channels
20/11/20 02:23:44 INFO channel.DefaultChannelFactory: Creating instance of channel MemChannel type memory
20/11/20 02:23:44 INFO node.AbstractConfigurationProvider: Created channel MemChannel
20/11/20 02:23:44 INFO source.DefaultSourceFactory: Creating instance of source Twitter, type org.apache.flume.source.twitter.TwitterSource
**20/11/20 02:23:44 ERROR node.AbstractConfigurationProvider: Source Twitter has been removed due to an error during configuration**
j**ava.lang.InstantiationException: Incompatible source and channel settings defined. source's batch size is greater than the channels transaction capacity. Source: Twitter, batch size = 1000, channel MemChannel, transaction capacity = 100**
    at org.apache.flume.node.AbstractConfigurationProvider.checkSourceChannelCompatibility(AbstractConfigurationProvider.java:386)
    at org.apache.flume.node.AbstractConfigurationProvider.getSourceChannels(AbstractConfigurationProvider.java:367)
    at org.apache.flume.node.AbstractConfigurationProvider.loadSources(AbstractConfigurationProvider.java:329)
    at org.apache.flume.node.AbstractConfigurationProvider.getConfiguration(AbstractConfigurationProvider.java:105)
    at org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcherRunnable.run(PollingPropertiesFileConfigurationProvider.java:145)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
20/11/20 02:23:44 INFO sink.DefaultSinkFactory: Creating instance of sink: HDFS, type: hdfs
20/11/20 02:23:44 INFO node.AbstractConfigurationProvider: Channel MemChannel connected to [HDFS]
20/11/20 02:23:44 INFO node.Application: Starting new configuration:{ sourceRunners:{} sinkRunners:{HDFS=SinkRunner: { policy:org.apache.flume.sink.DefaultSinkProcessor@78e3d64e counterGroup:{ name:null counters:{} } }} channels:{MemChannel=org.apache.flume.channel.MemoryChannel{name: MemChannel}} }
20/11/20 02:23:44 INFO node.Application: Starting Channel MemChannel
20/11/20 02:23:44 INFO instrumentation.MonitoredCounterGroup: Monitored counter group for type: CHANNEL, name: MemChannel: Successfully registered new MBean.
20/11/20 02:23:44 INFO instrumentation.MonitoredCounterGroup: Component type: CHANNEL, name: MemChannel started
20/11/20 02:23:44 INFO node.Application: Starting Sink HDFS
20/11/20 02:23:44 INFO instrumentation.MonitoredCounterGroup: Monitored counter group for type: SINK, name: HDFS: Successfully registered new MBean.
20/11/20 02:23:44 INFO instrumentation.MonitoredCounterGroup: Component type: SINK, name: HDFS started
终点站只是停留在这里,什么也没发生。我试着等了几分钟,但还是没变

我的配置文件twitter.conf位于/home/hadoop/Flume/conf,如下所示:

#Naming the components on the current agent.
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS

#Describing/Configuring the source
TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
TwitterAgent.sources.Twitter.consumerKey =##
TwitterAgent.sources.Twitter.consumerSecret =##
TwitterAgent.sources.Twitter.accessToken =##
TwitterAgent.sources.Twitter.accessTokenSecret =##
TwitterAgent.sources.Twitter.keywords =covid,covid-19,coronavirus

#Describing/Configuring the sink
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:9000/twitter_data/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 10
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollInterval = 600
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000

#Describing/Configuring the channel
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 100
TwitterAgent.channels.MemChannel.transactionCapacity = 100

#Binding the source and sink to the channel
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sinks.HDFS.channel = MemChannel
#Licensed to the Apache Software Foundation (ASF) under one
#or more contributor license agreements.  See the NOTICE file
#distributed with this work for additional information
#regarding copyright ownership.  The ASF licenses this file
#to you under the Apache License, Version 2.0 (the
#"License"); you may not use this file except in compliance
#with the License.  You may obtain a copy of the License at

#http://www.apache.org/licenses/LICENSE-2.0
#Unless required by applicable law or agreed to in writing, software
#distributed under the License is distributed on an "AS IS" BASIS,
#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#See the License for the specific language governing permissions and
#limitations under the License.

#If this file is placed at FLUME_CONF_DIR/flume-env.sh, it will be sourced
#during Flume startup.

#Enviroment variables can be set here.

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export CLASSPATH=$CLASSPATH:/home/hadoop/Flume/lib/*
FLUME_CLASSPATH="/home/hadoop/Flume/lib/flume-sources-1.0-SNAPSHOT.jar"

#Give Flume more memory and pre-allocate, enable remote monitoring via JMX
#export JAVA_OPTS="-Xms100m -Xmx2000m -Dcom.sun.management.jmxremote"

#Let Flume write raw event data and configuration information to its log files for debugging
#purposes. Enabling these flags is not recommended in production,
#as it may result in logging sensitive user information or encryption secrets.
#export JAVA_OPTS="$JAVA_OPTS -Dorg.apache.flume.log.rawdata=true -Dorg.apache.flume.log.printconfig=true "

#Note that the Flume conf directory is always included in the classpath.
#FLUME_CLASSPATH=""
我的flume-env.sh文件如下所示:

#Naming the components on the current agent.
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS

#Describing/Configuring the source
TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
TwitterAgent.sources.Twitter.consumerKey =##
TwitterAgent.sources.Twitter.consumerSecret =##
TwitterAgent.sources.Twitter.accessToken =##
TwitterAgent.sources.Twitter.accessTokenSecret =##
TwitterAgent.sources.Twitter.keywords =covid,covid-19,coronavirus

#Describing/Configuring the sink
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:9000/twitter_data/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 10
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollInterval = 600
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000

#Describing/Configuring the channel
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 100
TwitterAgent.channels.MemChannel.transactionCapacity = 100

#Binding the source and sink to the channel
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sinks.HDFS.channel = MemChannel
#Licensed to the Apache Software Foundation (ASF) under one
#or more contributor license agreements.  See the NOTICE file
#distributed with this work for additional information
#regarding copyright ownership.  The ASF licenses this file
#to you under the Apache License, Version 2.0 (the
#"License"); you may not use this file except in compliance
#with the License.  You may obtain a copy of the License at

#http://www.apache.org/licenses/LICENSE-2.0
#Unless required by applicable law or agreed to in writing, software
#distributed under the License is distributed on an "AS IS" BASIS,
#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#See the License for the specific language governing permissions and
#limitations under the License.

#If this file is placed at FLUME_CONF_DIR/flume-env.sh, it will be sourced
#during Flume startup.

#Enviroment variables can be set here.

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export CLASSPATH=$CLASSPATH:/home/hadoop/Flume/lib/*
FLUME_CLASSPATH="/home/hadoop/Flume/lib/flume-sources-1.0-SNAPSHOT.jar"

#Give Flume more memory and pre-allocate, enable remote monitoring via JMX
#export JAVA_OPTS="-Xms100m -Xmx2000m -Dcom.sun.management.jmxremote"

#Let Flume write raw event data and configuration information to its log files for debugging
#purposes. Enabling these flags is not recommended in production,
#as it may result in logging sensitive user information or encryption secrets.
#export JAVA_OPTS="$JAVA_OPTS -Dorg.apache.flume.log.rawdata=true -Dorg.apache.flume.log.printconfig=true "

#Note that the Flume conf directory is always included in the classpath.
#FLUME_CLASSPATH=""
错误是

j**ava.lang.InstantiationException: Incompatible source and channel settings defined. source's batch size is greater than the channels transaction capacity. Source: Twitter, batch size = 1000, channel MemChannel, transaction capacity = 100**
因此,您可以尝试减少源批次大小或增加通道容量以匹配源批次大小。

错误显示

j**ava.lang.InstantiationException: Incompatible source and channel settings defined. source's batch size is greater than the channels transaction capacity. Source: Twitter, batch size = 1000, channel MemChannel, transaction capacity = 100**

因此,您可以尝试减少源批次大小或增加通道容量以匹配源批次大小。

更新:显然,经过一些研究,我发现我使用了一个错误版本:flume-sources-1.0-SNAPSHOT.jar,它是flume的lib文件夹中的jar文件。通过以下方法生成我自己的jar修复了此问题:

更新:显然,经过一些研究,我发现我使用了一个糟糕的版本:flume-sources-1.0-SNAPSHOT.jar,它是在flume的lib文件夹中找到的jar文件。通过以下方法生成我自己的jar修复了它:

我这样做了,错误得到了修复,但现在我得到了一个新错误(在下一条评论中)。线程“Twitter4J异步调度程序[0]”java.lang.NoSuchMethodError中的异常:Twitter4J.json.JSONObjectType.Determinate(Ltwitter4j/internal/org/json/JSONObject;)Ltwitter4j/json/JSONObjectType;twitter4j.AbstractStreamImplementation$1.run(AbstractStreamImplementation.java:100)twitter4j.internal.async.ExecuteThread.run(DispatcherImpl.java:116)线程“Twitter Stream consumer-1[Receiving Stream]”java.lang.OutOfMemoryError:java堆空间在java.util.Arrays.copyOf(Arrays.java:3332)中出现异常在java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder)上,您现在似乎遇到了OOM。您可以尝试在
flume env.sh
中取消注释
export JAVA_OPTS
,并将JAVA max heap设置为更高的值,即将
-Xmx
设置为更高的值。谢谢!显然,经过一些研究,我发现我使用了一个糟糕的版本:flume-sources-1.0-SNAPSHOT.jar。通过以下方法生成我自己的jar修复:很高兴知道问题已经解决。我确实错过了上一条消息中的
NoSuchMethodError
日志。我这样做了,错误得到了修复,但现在我得到了一个新的错误(在下一条评论中)。线程“Twitter4J异步调度程序[0]”java.lang.NoSuchMethodError中的异常:Twitter4J.json.JSONObjectType.Determinate(Ltwitter4j/internal/org/json/JSONObject;)Ltwitter4j/json/JSONObjectType;twitter4j.AbstractStreamImplementation$1.run(AbstractStreamImplementation.java:100)twitter4j.internal.async.ExecuteThread.run(DispatcherImpl.java:116)线程“Twitter Stream consumer-1[Receiving Stream]”java.lang.OutOfMemoryError:java堆空间在java.util.Arrays.copyOf(Arrays.java:3332)中出现异常在java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder)上,您现在似乎遇到了OOM。您可以尝试在
flume env.sh
中取消注释
export JAVA_OPTS
,并将JAVA max heap设置为更高的值,即将
-Xmx
设置为更高的值。谢谢!显然,经过一些研究,我发现我使用了一个糟糕的版本:flume-sources-1.0-SNAPSHOT.jar。通过以下方法生成我自己的jar修复:很高兴知道问题已经解决。我肯定错过了最后一条消息中的
NoSuchMethodError
日志。