深入理解Hadoop Streaming

Size: px

Start display at page:

Download "深入理解Hadoop Streaming"

牺米
5 years ago
Views:

1 Hadoop Streaming 是 Hadoop 提供的一个 MapReduce 编程工具, 它允许用户使用任何可执行文件脚本语言或其他编程语言来实现 Mapper 和 Reducer 作业比如下面的例子 -mapper /bin/cat \ -reducer /usr/bin/wc Hadoop Streaming 程序是如何工作的 Hadoop Streaming 使用了 Unix 的标准输入输出作为 Hadoop 和其他编程语言的开发接口, 因此在其他的编程语言所写的程序中, 只需要将标准输入作为程序的输入, 将标准输出作为程序的输出就可以了在上面的示例中,mapper 和 reducer 都是能够从 stdin 逐行 (line by line) 读取输入的可执行文件, 然后把处理完的结果发送到 stdout 这个实用工具将会创建一个 Map / Reduce 作业, 并将作业提交到适当的集群, 监控作业的运行进度直到作业运行完成如果一个文件 ( 可执行或者脚本 ) 作为 mapper,mapper 初始化时, 每一个 mapper 任务会把该文件作为一个单独进程启动,mapper 任务运行时, 它把输入切分成行并把每一行提供给可执行文件进程的标准输入同时,mapper 收集可执行文件进程标准输出的内容, 并把收到的每一行内容转化成 key/value 对, 作为 mapper 的输出默认情况下, 一行中第一个 tab 之前的部分作为 key, 之后的 ( 不包括 tab) 作为 value 如果没有 tab, 整行作为 key 值,value 值为 null reducer 的运行过程和这个类似, 就不介绍以上是 Map/Reduce 框架和 streaming mapper/reducer 之间的基本通信协议用户可以定义 stream.non.zero.exit.is.failure 参数为 true 或者 false 以定义一个以非 0 状态退出的 streaming 的任务是失败 (Failure) 还是成功 (Success) 默认情况下, 以非 0 状态退出的任务都任务是失败的 Streaming 命令行选项 (Streaming Command Options) Hadoop Streaming 除了支持流命令选项 (Streaming Command 1 / 12

2 Options) 外, 还支持 Hadoop 的通用命令选项 (generic command options), 通用命令选项这个会在本文的下面进行介绍命令得使用规则如下 : mapred streaming [genericoptions] [streamingoptions] 需要注意的是, 在提交 Streaming 作业中使用到通用命令选项的时候, 需要把通用命令选项设置在流命令选项之前, 否则将会出现一些错误目前的 Hadoop streaming (Hadoop 3.0.0) 支持的流命令选项如下 : 参数是否可选描述 -input directoryname or filename Required mapper 的输入路径 -output directoryname Required reducer 输出路径 -mapper executable or JavaClassName -reducer executable or JavaClassName Optional Optional Mapper 可执行程序或 Java 类名 Reducer 可执行程序或 Java 类名 -file filename Optional mapper, reducer 或 combiner 依赖的文件 -inputformat JavaClassName Optional key/value 输入格式, 默认为 TextInputFormat -outputformat JavaClassName Optional key/value 输出格式, 默认为 TextOutputformat -partitioner JavaClassName Optional Class that determines which reduce a key is sent to -combiner streamingcommand or JavaClassName Optional map 输出结果执行 Combiner 的命令或者类名 -cmdenv name=value Optional 环境变量 -inputreader Optional 向后兼容, 定义输入的 Reader 类, 用于取代输出格式 -verbose Optional 输出日志 2 / 12

3 -lazyoutput Optional 延时输出 -numreducetasks Optional 定义 reduce 数量 -mapdebug Optional map 任务运行失败时候, 执行的脚本 -reducedebug Optional reduce 任务运行失败时候, 执行的脚本指定一个 Java 类作为 Mapper/Reducer 我们可以指定一个 Java 类作为 Mapper/Reducer, 使用如下 : -inputformat org.apache.hadoop.mapred.keyvaluetextinputformat \ -mapper org.apache.hadoop.mapred.lib.identitymapper \ -reducer /usr/bin/wc 提交作业的时候打包文件正如上面介绍的, 我们可以指定任意的可执行文件作为 mapper 或者 Reduce 在提交 Hadoop Streaming 作业的时候, mapper 或者 Reduce 程序不需要事先部署在 Hadoop 集群的任意一台机器上, 我们仅仅需要在提交 Streaming 作业的时候指定 -file 参数, 这样 Hadoop 会自动将这些文件分发到集群使用如下 : -mapper mypythonscript.py \ -reducer /usr/bin/wc \ -file mypythonscript.py 上面命令中 -file mypythonscript.py 会导致 Hadoop 将这个文件自动分发到集群除了可以指定可执行文件之外, 我们还可以打包 mapper 或者 Reduce 程序会用到的文件 ( 包括目录, 配置文件等 ), 比如 : 3 / 12

4 -mapper mypythonscript.py \ -reducer /usr/bin/wc \ -file mypythonscript.py \ -file mydictionary.txt 为作业指定其他插件与正常的 Map / Reduce 作业一样, 我们还可以为流式作业指定其他插件, 选项如下 : -inputformat JavaClassName -outputformat JavaClassName -partitioner JavaClassName -combiner streamingcommand or JavaClassName 我们为 -inputformat 指定的 class 文件必须返回 Text 类型的 key/value 键值对如果你没有指定 input format 类, 默认使用的是 TextInputFormat 类 TextInputFormat 中 key 的返回类型是 Long Writable, 这个并不是输入数据的一部分, 所以 key 部分将会被忽略, 而仅仅返回 value 部分为 -outputformat 指定的 class 文件接收的数据类型是 Text 类型的 key/value 键值对如果我们没有指定 output format 类, 默认使用 TextOutputFormat 设置环境变量我们可以在提交 Streaming 作业的时候设置环境变量, 使用如下 : -cmdenv EXAMPLE_DIR=/home/example/dictionaries/ 通用命令选项 (Generic Command Options) 在提交流作业的时候, 可支持的通用命令选项主要有以下几个 : 参数是否可选描述 -conf configuration_file Optional 定义应用的配置文件 4 / 12

5 -D property=value Optional 定义参数 -fs host:port or local Optional 定义 namenode 地址 -files Optional 定义需要拷贝到 Map/Reduce 集群的文件, 多个文件以逗号分隔 -libjars Optional 定义需要引入到 classpath 的 jar 文件, 多个文件以逗号分隔 -archives Optional 定义需要解压到计算节点的压缩文件, 多个文件以逗号分隔通过 -D 选项指定配置变量我们可以通过 -D <property>=<value> 的方式指定额外的配置变量 (configuration variables) 指定目录为了改变默认的本地临时目录, 可以使用下面的命令 : -D dfs.data.dir=/tmp 增加额外的本地临时目录可以使用下面命令 : -D mapred.local.dir=/tmp/local -D mapred.system.dir=/tmp/system -D mapred.temp.dir=/tmp/temp 设置只有 Map 的作业有时候我们仅仅想跑只有 Map 的 Hadoop 作业, 只需要将 mapreduce.job.reduces 设置为 0 即实现这会导致 Map/Reduce 框架不会启动 Reduce 类型的 task map task 的输出就是作业的最终结果输出, 设置如下 : -D mapreduce.job.reduces=0 5 / 12

6 为了向后兼容,Hadoop Streaming 还支持 -reducer NONE 选项, 其含义等同于 -D mapreduce.job.reduces=0 设置 Reduce 的个数下面例子中将程序的 reduce 个数设置为 2: -D mapreduce.job.reduces=2 \ -mapper /bin/cat \ -reducer /usr/bin/wc 自定义行行数据如何拆分成 Key/Value 键值对本文开头介绍过, 当 Map/Reduce 框架从 stdout 读取行数据的时候, 它会把一行数据拆分成一个 k ey/value 键值对默认情况下,tab 制表符分割的前一部分数据是作为 key 的 ; 后一部分数据作为 v alue 当然, 我们可以自定义行数据的分隔符如下所示 : -D stream.map.output.field.separator=. \ -D stream.num.map.output.key.fields=4 \ -mapper /bin/cat \ -reducer /bin/cat 在上面例子中,stream.map.output.field.separator 指定. 为 key 和 value 的分隔符使用大文件或归档文件我们可以使用 -files 和 -archives 选项分别指定文件或者归档文件 (archives), 这些文件可以被 task s 使用使用这个选项时, 需要我们把这些文件或者 archives 上传到 HDFS 这些文件在作业执行的时候会被缓存到所有的 jobs 中 Making Files Available to Tasks -files 选项会在当前 tasks 的工作目录 (current working directory) 下创建一个符号链接 (symlink), 这个链接指定的就是从 HDFS 拷贝文件的副本下面例子中, 我们指定了 HDFS 上的 testfile.txt 文件 6 / 12

7 , 在使用 -files 选项之后, 其会在 Tasks 的当前工作目录下创建名为 testfile.txt 的符号链接 -files hdfs://host:fs_port/user/testfile.txt 当然, 我们也可以自己通过 # 设置符号链接的名字 : -files hdfs://host:fs_port/user/testfile.txt#testfile 如果需要指定多个文件, 使用如下 : -files hdfs://host:fs_port/user/testfile1.txt,hdfs://host:fs_port/user/testfile2.txt Making Archives Available to Tasks -archives 选项允许我们指定一些压缩好的文件 ( 比如 jar tgz), 这些压缩文件会被拷贝到 Tasks 的当前工作目录, 然后会被自动解压在下面的例子中, 我们指定了 HDFS 上的 iteblog.jar 压缩文件,Hadoop 会自动为我们在 Tasks 的当前工作目录下创建一个名为 iteblog.jar 的符号链接这个链接指定的是解压之后的文件夹名称 : -archives hdfs://host:fs_port/user/iteblog.jar 同样, 我们也可以自己设置符号链接的名字 : -archives hdfs://host:fs_port/user/iteblog.tgz#tgzdir 下面的例子中,input.txt 文件里面只有两行数据, 分别是两个文件的名字 : cachedir.jar/cache.txt 和 cachedir.jar/cache2.txt;cachedir.jar 是符号链接, 其目录下包含了两个文件 :cache.txt 和 cache2.txt 7 / 12

8 -archives 'hdfs://iteblog.com/user/me/samples/cachefile/cachedir.jar' \ -D mapreduce.job.maps=1 \ -D mapreduce.job.reduces=1 \ -D mapreduce.job.name="experiment" \ -input "/user/me/samples/cachefile/input.txt" \ -output "/user/me/samples/cachefile/out" \ -mapper "xargs cat" \ -reducer "cat" $ ls test_jar/ cache.txt cache2.txt $ jar cvf cachedir.jar -C test_jar/. added manifest adding: cache.txt(in = 30) (out= 29)(deflated 3%) adding: cache2.txt(in = 37) (out= 35)(deflated 5%) $ hdfs dfs -put cachedir.jar samples/cachefile $ hdfs dfs -cat /user/me/samples/cachefile/input.txt cachedir.jar/cache.txt cachedir.jar/cache2.txt $ cat test_jar/cache.txt This is just the cache string $ cat test_jar/cache2.txt This is just the second cache string $ hdfs dfs -ls /user/me/samples/cachefile/out Found 2 items -rw-r--r-* 1 me supergroup :00 /user/me/samples/cachefile/out/_success -rwr--r-* 1 me supergroup :00 /user/me/samples/cachefile/out/part $ hdfs dfs -cat /user/me/samples/cachefile/out/part This is just the cache string This is just the second cache string 更多的使用例子 Hadoop Partitioner Class 8 / 12

9 Hadoop 内置提供了一个名为 KeyFieldBasedPartitioner 的类, 这个类在很多程序中使用这个类可以将 map 输出的内容按照分隔后的一定列, 而不是整个 key 内容进行分区, 例如 : -D stream.map.output.field.separator=. \ -D stream.num.map.output.key.fields=4 \ -D map.output.key.field.separator=. \ -D mapreduce.partition.keypartitioner.options=-k1,2 \ -D mapreduce.job.reduces=12 \ -mapper /bin/cat \ -reducer /bin/cat \ -partitioner org.apache.hadoop.mapred.lib.keyfieldbasedpartitioner map.output.key.field.separator=.: 设置 map 输出分区时 key 内部的分割符为. mapreduce.partition.keypartitioner.options=-k1,2: 设置按前两个字段分区 mapreduce.job.reduces=12:reduce 数为 12 比如上面例子 map 输出的 key 如下 : 按照前两个字段进行分区, 则会分为三个分区 : / 12

10 在每个分区内对整行内容排序后为 : Hadoop Comparator Class Hadoop 中有一个类 KeyFieldBasedComparator, 提供了 Unix/GNU 中排序的一部分特性使用如下 : -D mapreduce.job.output.key.comparator.class=org.apache.hadoop.mapreduce.lib.partition. KeyFieldBasedComparator \ -D stream.map.output.field.separator=. \ -D stream.num.map.output.key.fields=4 \ -D mapreduce.map.output.key.field.separator=. \ -D mapreduce.partition.keycomparator.options=-k2,2nr \ -D mapreduce.job.reduces=1 \ -mapper /bin/cat \ -reducer /bin/cat mapreduce.partition.keycomparator.options=-k2,2nr: 指定第二个字段为排序字段,-n 是指按自然顺序排序,-r 指倒叙排序比如上面例子 map 输出的 key 如下 : / 12

11 那么 Reduce 的输出结果如下 Hadoop Aggregate Package Hadoop 中有一个类 Aggregate,Aggregate 提供了一个特定的 reduce 类和 combiner 类, 以及一些对 reduce 输出的聚合函数, 例如 sum min max 等等为了使用 Aggregate, 我们只需要定义 -reducer aggregate 参数, 如下 : -mapper myaggregatorforkeycount.py \ -reducer aggregate \ -file myaggregatorforkeycount.py \ myaggregatorforkeycount.py 文件大概内容如下 : #!/usr/bin/python import sys; def generatelongcounttoken(id): return "LongValueSum:" + id + "\t" + "1" def main(argv): line = sys.stdin.readline(); try: while line: line = line[:-1]; fields = line.split("\t"); print generatelongcounttoken(fields[0]); line = sys.stdin.readline(); except "end of file": return None 11 / 12

12 Powered by TCPDF ( if name == " main ": main(sys.argv) Hadoop Field Selection Class Hadoop 中有一个类 FieldSelectionMapReduce, 运行你像 unix 中的 cut 命令一样处理文本使用如下 : -D mapreduce.map.output.key.field.separator=. \ -D mapreduce.partition.keypartitioner.options=-k1,2 \ -D mapreduce.fieldsel.data.field.separator=. \ -D mapreduce.fieldsel.map.output.key.value.fields.spec=6,5,1-3:0- \ -D mapreduce.fieldsel.reduce.output.key.value.fields.spec=0-2:5- \ -D mapreduce.map.output.key.class=org.apache.hadoop.io.text \ -D mapreduce.job.reduces=12 \ -mapper org.apache.hadoop.mapred.lib.fieldselectionmapreduce \ -reducer org.apache.hadoop.mapred.lib.fieldselectionmapreduce \ -partitioner org.apache.hadoop.mapred.lib.keyfieldbasedpartitioner mapreduce.fieldsel.map.output.key.value.fields.spec=6,5,1-3:0-: 意思是 map 的输出中 key 部分包括分隔后的第列, 而 value 部分包括分隔后的所有的列 mapreduce.fieldsel.reduce.output.key.value.fields.spec=0-2:5-: 意思是 map 的输出中 key 部分包括分隔后的第列, 而 value 部分包括分隔后的从第 5 列开始的所有列本博客文章除特别声明, 全部都是原创! 转载本文请加上 : 转载自过往记忆 ( 本文链接 : () 12 / 12

-mapper rg.apache.hadp.mapred.lib.identitymapper \ -reducer /bin/wc 用户可以设定 stream.nn.zer.exit.is.failure true 或 false 来表明 streaming task 的返回值非零时是 Fai

$-mapper rg.apache.hadp.mapred.lib.identitymapper \ -reducer /bin/wc 用户可以设定 stream.nn.zer.exit.is.failure true 或 false 来表明 streaming task 的返回值非零时是 Fai$ Hadp Streaming Hadp Streaming Hadp streaming 是 Hadp 的一个工具, 它帮助用户创建和运行一类特殊的 map/reduce 作业, 这些特殊的 map/reduce 作业是由一些可执行文件或脚本文件充当 mapper 或者 reducer 例如 : -mapper /bin/cat \ -reducer /bin/wc Streaming 工作原理