docker - What causes flume with GCS sink to throw a OutOfMemoryException -

- April 15, 2011

i using flume write google cloud storage. flume listens on http:9000. took me time make work (add gcs libaries, use credentials file...) seems communicate on network.

i sending small http request tests, , have plenty of ram available:

curl -x post -d '[{ "headers" : { timestamp=1417444588182, env=dev, tenant=mytenant, type=mytype }, "body" : "some body one"  }]' localhost:9000

i encounter memory exception on first request (then of course, stops working):

2014-11-28 16:59:47,748 (hdfs-hdfs_sink-call-runner-0) [info - com.google.cloud.hadoop.util.logutil.info(logutil.java:142)] ghfs version: 1.3.0-hadoop2 2014-11-28 16:59:50,014 (sinkrunner-pollingrunner-defaultsinkprocessor) [error - org.apache.flume.sink.hdfs.hdfseventsink.process(hdfseventsink.java:467)] process failed java.lang.outofmemoryerror: java heap space         @ java.io.bufferedoutputstream.<init>(bufferedoutputstream.java:76)         @ com.google.cloud.hadoop.fs.gcs.googlehadoopoutputstream.<init>(googlehadoopoutputstream.java:79)         @ com.google.cloud.hadoop.fs.gcs.googlehadoopfilesystembase.create(googlehadoopfilesystembase.java:820)         @ org.apache.hadoop.fs.filesystem.create(filesystem.java:906)

(see complete stack trace gist full details)

the strange part folders , files created way want, files empty.

gs://my_bucket/dev/mytenant/mytype/2014-12-01/14-36-28.1417445234193.json.tmp

is wrong way configured flume + gcs or bug in gcs.jar ?

where should check gather more data ?

ps : running flume-ng inside docker.

my flume.conf file:

# name components on agent a1.sources = http a1.sinks = hdfs_sink a1.channels = mem  # describe/configure source a1.sources.http.type =  org.apache.flume.source.http.httpsource a1.sources.http.port = 9000  # describe sink a1.sinks.hdfs_sink.type = hdfs a1.sinks.hdfs_sink.hdfs.path = gs://my_bucket/%{env}/%{tenant}/%{type}/%y-%m-%d a1.sinks.hdfs_sink.hdfs.fileprefix = %h-%m-%s a1.sinks.hdfs_sink.hdfs.filesuffix = .json a1.sinks.hdfs_sink.hdfs.round = true a1.sinks.hdfs_sink.hdfs.roundvalue = 10 a1.sinks.hdfs_sink.hdfs.roundunit = minute  # use channel buffers events in memory a1.channels.mem.type = memory a1.channels.mem.capacity = 10000 a1.channels.mem.transactioncapacity = 1000  # bind source , sink channel a1.sources.http.channels = mem a1.sinks.hdfs_sink.channel = mem

related question in flume/gcs journey: what minimal setup needed write hdfs/gs on google cloud storage flume?

when uploading files, gcs hadoop filesystem implementation sets aside large (64mb) write buffer per fsdataoutputstream (file open write). can changed setting "fs.gs.io.buffersize.write" smaller value, in bytes, in core-site.xml. imagine 1mb suffice low-volume log collection.

in addition, check maximum heap size set when launching jvm flume. flume-ng script sets default java_opts value of -xmx20m limit heap 20mb. can set larger value in flume-env.sh (see conf/flume-env.sh.template in flume tarball distribution details).