Flink on YARN部署快速入门指南

Apache Flink 是一个高效分布式基于 Java 和 Scala( 主要是由 Java 实现 ) 实现的通用大数据分析引擎, 它具有分布式 MapReduce 一类平台的高效性灵活性和扩展性以及并行数据库查询优化方案, 它支持批量和基于流的数据分析, 且提供了基于 Java 和 Scala 的 API 从 Flink 官方文档可以知道, 目前 Flink 支持三大部署模式 :Local Cluster 以及 Cloud, 如下图所示 : 本文将简单地介绍如何部署 Apache Flink On YARN( 也就是如何在 YARN 上运行 Flink 作业 ), 本文是基于 Apache Flink 1.0.0 以及 Hadoop 2.2.0 在 YARN 上启动一个 Flink 主要有两种方式 :(1) 启动一个 YARN session(start a long-running Flink cluster on YARN);(2) 直接在 YARN 上提交运行 Flink 作业 (Run a Flink job on YARN) 下面将分别进行介绍 Flink YARN Session 这种模式下会启动 yarn session, 并且会启动 Flink 的两个必要服务 :JobManager 和 TaskMan agers, 然后你可以向集群提交作业同一个 Session 中可以提交多个 Flink 作业需要注意的是, 这种模式下 Hadoop 的版本至少是 2.2, 而且必须安装了 HDFS( 因为启动 YARN session 的时候会向 1 / 6

HDFS 上提交相关的 jar 文件和配置文件 ) 我们可以通过./bin/yarn-session.sh 脚本启动 YARN Session, 由于我们第一次使用这个脚本, 我们先看看这个脚本支持哪些参数 : [iteblog@www.iteblog.com flink]$./bin/yarn-session.sh Usage: Required -n,--container <arg> Number of YARN container to allocate (=Number of Task Managers) Optional -D <arg> Dynamic properties -d,--detached Start detached -jm,--jobmanagermemory <arg> Memory for JobManager Container [in MB] -nm,--name <arg> Set a custom name for the application on YARN -q,--query Display available YARN resources (memory, cores) -qu,--queue <arg> Specify YARN queue. -s,--slots <arg> Number of slots per TaskManager -st,--streaming Start Flink in streaming mode -tm,--taskmanagermemory <arg> Memory per TaskManager Container [in MB] 各个参数的含义里面已经介绍的很详细了在启动的是可以指定 TaskManager 的个数以及内存 ( 默认是 1G), 也可以指定 JobManager 的内存, 但是 JobManager 的个数只能是一个好了, 我们开启动一个 YARN session 吧 :./bin/yarn-session.sh -n 4 -tm 8192 -s 8 上面命令启动了 4 个 TaskManager, 每个 TaskManager 内存为 8G 且占用了 8 个核 ( 是每个 TaskMana ger, 默认是 1 个核 ) 在启动 YARN session 的时候会加载 conf/flink-config.yaml 配置文件, 我们可以根据自己的需求去修改里面的相关参数 ( 关于里面的参数含义请参见 Flink 官方文档介绍吧 ) 一切顺利的话, 我们可以在 https://www.iteblog.com:9981/proxy/application_1453101066555_2 766724/#/overview 上看到类似于下面的页面 : 2 / 6

启动了 YARN session 之后我们如何运行作业呢? 很简单, 我们可以使用./bin/flink 脚本提交作业, 同样我们来看看这个脚本支持哪些参数 : [iteblog@www.iteblog.com flink-1.0.0]$ bin/flink./flink <ACTION> [OPTIONS] [ARGUMENTS] The following actions are available: Action "run" compiles and runs a program. Syntax: run [OPTIONS] <jar-file> <arguments> "run" action options: -c,--class <classname> Class with the program entry point ("main" method or "getplan()" method. Only needed if the JAR file does not specify the class in its manifest. -C,--classpath <url> Adds a URL to each user code classloader on all nodes in the cluster. The paths must specify a protocol (e.g. file://) and be accessible on all nodes (e.g. by means of a NFS share). You can use this option multiple times for specifying more than one URL. The protocol must be supported by the {@link java.net.urlclassloader}. -d,--detached If present, runs the job in detached 3 / 6

mode -m,--jobmanager <host:port> Address of the JobManager (master) to which to connect. Specify 'yarn-cluster' as the JobManager to deploy a YARN cluster for the job. Use this flag to connect to a different JobManager than the one specified in the configuration. -p,--parallelism <parallelism> The parallelism with which to run the program. Optional flag to override the default value specified in the configuration. -q,--sysoutlogging If present, supress logging output to standard out. -s,--fromsavepoint <savepointpath> Path to a savepoint to reset the job back to (for example file:///flink/savepoint-1537). 我们可以使用 run 选项运行 Flink 作业这个脚本可以自动获取到 YARN session 的地址, 所以我们可以不指定 --jobmanager 参数我们以 Flink 自带的 WordCount 程序为例进行介绍, 先将测试文件上传到 HDFS 上 : hadoop fs -copyfromlocal LICENSE hdfs:///user/iteblog/ 然后将这个文件作为输入并运行 WordCount 程序 :./bin/flink run./examples/batch/wordcount.jar --input hdfs:///user/iteblog/license 一切顺利的话, 可以看到在终端会显示出计算的结果 : (0,9) (1,6) (10,3) (12,1) (15,1) (17,1) (2,9) 4 / 6

(2004,1) (2010,2) (2011,2) (2012,5) (2013,4) (2014,6) (2015,7) (2016,2) (3,6) (4,4) (5,3) (50,1) (6,3) (7,3) (8,2) (9,2) (a,25) (above,4) (acceptance,1) (accepting,3) (act,1) 如果我们不想将结果输出在终端, 而是保存在文件中, 可以使用 --output 参数指定保存结果的地方 :./bin/flink run./examples/batch/wordcount.jar \ --input hdfs:///user/iteblog/license \ --output hdfs:///user/iteblog/result.txt 然后我们可以到 hdfs:///user/iteblog/result.txt 文件里面查看刚刚运行的结果需要注意的是 :1 上面的 --input 和 --output 参数并不是 Flink 内部的参数, 而是 WordCount 程序中定义的 ; 2 指定路径的时候一定记得需要加上模式, 比如上面的 hdfs://, 否者程序会在本地寻找文件 Run a single Flink job on YARN 上面的 YARN session 是在 Hadoop YARN 环境下启动一个 Flink cluster 集群, 里面的资源是可 5 / 6

Powered by TCPDF (www.tcpdf.org) 以共享给其他的 Flink 作业我们还可以在 YARN 上启动一个 Flink 作业这里我们还是使用./bin/fli nk, 但是不需要事先启动 YARN session:./bin/flink run -m yarn-cluster -yn 2./examples/batch/WordCount.jar \ --input hdfs:///user/iteblog/license \ --output hdfs:///user/iteblog/result.txt 上面的命令同样会启动一个类似于 YARN session 启动的页面其中的 - yn 是指 TaskManager 的个数, 必须指定本博客文章除特别声明, 全部都是原创! 转载本文请加上 : 转载自过往记忆 (https://www.iteblog.com/) 本文链接 : () 6 / 6