一、flume采集到HDFS
1、flume安装准备
jdk1.8安装: https://www.cnblogs.com/zeze/p/5902124.html
java -version
2、flume安装
下载 wget http://mirrors.tuna.tsinghua.edu.cn/apache/flume/1.8.0/apache-flume-1.8.0-bin.tar.gz
解压 tar -zxvf apache-flume-1.8.0-bin.tar.gz
官方文档:http://flume.apache.org/FlumeUserGuide.html (强烈推荐官方文档,最权威的)
3、flume配置
配置jdk和内存 cp conf/flume-env.sh.template conf/flume-env.sh
启动 nohup ./bin/flume-ng agent –conf conf/ –conf-file conf/flume-conf –name a1 -Dflume.monitoring.type=http -Dflume.monitoring.port=34343 -Dflume.root.logger=INFO,console >> ./logs/flume.log 2>&1 &
校验 bin/flume-ng version
4、flume日志采集-单文档采集
需求:将服务器X1上产生的日志test_201812177.log采集到大数据集群上CDH
- 4.1 架构
- 4.2 日志机器配置
# example.conf: A single-node Flume configuration
#gent的名称为"a1"
a1.sources = r1
a1.sinks = k1-1 k1-2 k1-3
a1.channels = c1
#***日志收集***
#source配置信息
a1.sources.r1.type = TAILDIR
a1.sources.r1.positionFile = /home/apache-flume-1.8.0-bin/logs/test.json
a1.sources.r1.filegroups = f1
a1.sources.r1.filegroups.f1 = /home/logs/test_.*log
a1.sources.r1.headers.f1.headerKey1 = value1
a1.sources.r1.fileHeader = true
#sink组
a1.sinkgroups=g1
a1.sinkgroups.g1.sinks=k1-1 k1-2 k1-3
a1.sinkgroups.g1.processor.type=failover
a1.sinkgroups.g1.processor.priority.k1-1=10
a1.sinkgroups.g1.processor.priority.k1-2=5
a1.sinkgroups.g1.processor.priority.k1-3=1
a1.sinkgroups.g1.processor.maxpenalty=10000
#sink配置信息(故障转移,优先级依次从高到低)
a1.sinks.k1-1.type = avro
a1.sinks.k1-1.hostname = 192.168.0.1
a1.sinks.k1-1.port = 41401
a1.sinks.k1-2.type = avro
a1.sinks.k1-2.hostname = 192.168.0.2
a1.sinks.k1-2.port = 41401
a1.sinks.k1-3.type = avro
a1.sinks.k1-3.hostname = 192.168.0.3
a1.sinks.k1-3.port = 41401
#channel配置信息
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
#将source和sink绑定至该channel上
a1.sources.r1.channels = c1
a1.sinks.k1-1.channel = c1
a1.sinks.k1-2.channel = c1
a1.sinks.k1-3.channel = c1
- 4.3 CDH上flume配置
#gent的名称为"a1"
a1.sources = r1
a1.sinks = k1
a1.channels = c1
#***发送验证码日志收集***
#source配置信息
a1.sources.r1.type = avro
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 41401
#sink配置信息
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.fileType=DataStream
a1.sinks.k1.hdfs.useLocalTimeStamp=true
a1.sinks.k1.hdfs.path = /user/hive/warehouse/logs/test/ymd=%Y%m%d
a1.sinks.k1.hdfs.filePrefix = log
a1.sinks.k1.hdfs.inUseSuffix=.txt
a1.sinks.k1.hdfs.writeFormat = Text
a1.sinks.k1.hdfs.idleTimeout = 3600
a1.sinks.k1.hdfs.batchSize=10
a1.sinks.k1.hdfs.rollSize = 0
a1.sinks.k1.hdfs.rollInterval = 0
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.minBlockReplicas = 1
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 1
#channel配置信息
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
#将source和sink绑定至该channel上
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
5、flume问题总结
5.1 org/apache/flume/node/Application : Unsupported major.minor version 52.0
java版本不匹配,flume1.8 测试 jdk1.8 没问题
5.2 采集到HDFS下的日志出现大量小文件,导致namenode有时会挂掉,元数据太大导致
出现小文件时的配置:
a1.sinks.k1.hdfs.writeFormat = Text
a1.sinks.k1.hdfs.rollSize = 10240 # 设置为0就ok
a1.sinks.k1.hdfs.rollInterval = 600 # 设置为0就ok
a1.sinks.k1.hdfs.minBlockReplicas = 1
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 1
原因:
rollSize 默认值:1024,当临时文件达到该大小(单位:bytes)时,滚动成目标文件。如果设置成0,则表示不根据临时文件大小来滚动文件。
rollCount 默认值:10,当events数据达到该数量时候,将临时文件滚动成目标文件,如果设置成0,则表示不根据events数据来滚动文件。
round 默认值:false,是否启用时间上的”舍弃”,类似于”四舍五入”,如果启用,则会影响除了%t的其他所有时间表达式;
roundValue 默认值:1,时间上进行“舍弃”的值;
roundUnit 默认值:seconds,时间上进行”舍弃”的单位,包含:second,minute,hour
当设置了round、roundValue、roundUnit参数收,需要在sink指定的HDFS路径上指定按照时间生成的目录的格式,例如有需求,每采集1小时就在HDFS目录上生成一个目录,里面存放这1小时内采集到的数据。
参考文献:
linux jdk安装https://www.cnblogs.com/zeze/p/5902124.html
hdfs产生大量小文件解决https://blog.csdn.net/whdxjbw/article/details/80606917
Flume1.7.0入门:安装、部署、及flume的案例https://www.cnblogs.com/netbloomy/p/6666683.html
转载请注明:SuperIT » flume采集到HDFS