Hadoop 的 MapReduce API 提供自動的平行化與工作分配容錯特性狀態監控工具一個乾淨的抽象化 (abstration) 供程式設計師使用 97

Similar documents

Hadoop 集群 ( 第 6 期 ) WordCount 运行详解 1 MapReduce 理论简介 1.1 MapReduce 编程模型 MapReduce 采用分而治之的思想, 把对大规模数据集的操作, 分发给一个主节点管

雲端 Cloud Computing 技術指南運算應用平台與架構 10/04/15 11:55:46 INFO 10/04/15 11:55:53 INFO 10/04/15 11:55:56 INFO 10/04/15 11:56:05 INFO 10/04/15 11:56:07 INFO

Day1-hadoop_ _v1

Java ¿ª·¢ 2.0: ÓÃ Hadoop MapReduce ½øÐÐ´óÊý¾Ý·ÖÎö

目录 1 本期内容 MapReduce 理论简介 MapReduce 编程模型 MapReduce 处理过程运行 WordCount 程序准备工作运行例子

关于天云趋势天云趋势由宽带资本和趋势科技共同投资成立于 2010 年 3 月趋势科技是 Hadoop 的重度使用者 : 2006 年开始使用, 用于处理网页和邮件服务器评级五个数据中心, 近 1000 个节点, 最大集群约 500 台服务器日均处理 3.6T 日志数据亚洲最早, 也是最大的

三种方法实现Hadoop(MapReduce)全局排序(1)

使用MapReduce读取XML文件

新・解きながら学ぶJava

Hadoop&Spark解决二次排序问题(Hadoop篇)

三种方法实现Hadoop(MapReduce)全局排序(2)

第期熊安萍等 *1$ 文件系统中范围锁机制的应用研究! 锁命名空间 '+'($($ 描述了资源和锁的集合它同时在客户节点和服务节点存在不同之处只是利用一个数据标识来有效区

中国中西医结合杂志年月第卷第期!" 通透性增加产生蛋白水解酶促进血管内皮细胞有丝分裂内皮细胞从基底膜上迁移到血管周围间隙粘附聚集重构为三维管腔并与周围血管

《安徒生童话》（四）

Microsoft PowerPoint - hbase_program(0201).ppt

Microsoft Word - 01.DOC

7521,WARD,SALESMAN,7698,22-2 月 -81,1250,500, ,JONES,MANAGER,7839,02-4 月 -81,2975,, ,MARTIN,SALESMAN,7698,28-9 月 -81,1250,1400, ,BLAK

《饲料和饲料添加剂管理条例》

编写简单的Mapreduce程序并部署在Hadoop2.2.0上运行

4 中南大学学报医学版摘要目的探讨早发性精神分裂症患者在静息状态下是否存在脑功能连接异常以及异常区域的定位方法采用第版美国精神障碍诊断与统计手册 ( * ) (

全国计算机技术与软件专业技术资格（水平）考试

《浮士德》（下）

中国中西医结合杂志年月第卷第期!"# $! 症状在诊断时推荐应用 $3 的症状指数 $!0 " 0 %!2 3% ". )./!0 ) 1/! 5 1! 0 %7$3 6 进行基础评估和治疗监测心理状况的评估可

(\244j\257d\276\307\274\351_ C.indd_70%.pdf)

## $%& %& ## () #) (( * (+++ () #) #) (+ (+ #) #) ( #, - #)). #))- # ( / / / / $ # ( * *..# 4 #$ 3 ( 5 ) ### 4 $ # 5, $ ## # 4 $# 5 ( %

! # % % & # # % #!& % &# % &# % % % # %& ( (!& (! & & % % #!! ) %&! *& % %! % %!! # % %!! %*!& % &# % &# ) ) ( % # # ) % ( (!& (! (!! # % % #!! # ( &!

1 中华物理医学与康复杂志, - 年月第.0 卷第期 & + &# * & " (, - ".0 $ 代康复理念更强调患者主动参与因此笔者倾向于采用球囊主动扩张术即治疗时以患者主动参与为主

期李海利等猪接触传染性胸膜肺炎放线杆菌血清型分子鉴定及药敏试验 / 只产生两种,9: 毒素血清型毒力的强弱与,9: 毒素种类有关产,9: 和,9: 的血清型毒力最强本研究对临

目录 1 本期内容 Hadoop 开发环境简介 Hadoop 集群简介 Windows 开发简介 Hadoop Eclipse 简介和使用 Eclipse 插件介绍 Hadoo

气候与环境研究卷 &!' 张书余许多学者对人体舒适度进行了研究!!0!! " 对欧洲不同国家的城市热舒适性进行了研究周后福探讨了气候变化对人体健康的影响吴兑 ) 进行了多

《沧浪诗话》

09 (File Processes) (mkdir) 9-3 (createnewfile) 9-4 (write) 9-5 (read) 9-6 (deletefile) 9-7 (deletedir) (Exercises)

《安徒生童话》（一）

EJB-Programming-4-cn.doc

国际政治研究年第期一中国国名渊源暨中外交流中中国的称谓一不在乎国名的王朝国家世界上绝大多数国家的国名是在历史上逐渐形成的国名具有排他性宣示一国之主权国

KillTest 质量更高服务更好学习资料半年免费更新服务

期李环等邻苯二甲酸二丁酯暴露对雄性大鼠生精细胞功能影响 1 )!# $ + $#'!!) #!%,$' $ 6. $#! +!! '!!' # $! 引言 - # # 近年来生殖健康问题日益突出 % 不孕不育等各

1: public class MyOutputStream implements AutoCloseable { 3: public void close() throws IOException { 4: throw new IOException(); 5: } 6:

% 缓解患者的心理障碍或问题, 促进其人格向健康治疗协调的方向发展精神分析学派心理治疗起源于弗洛依德 ( ) 于世早期为弗洛依德创立的经典精神分析弗洛纪末创始的精

社会妇也有到夫家守志的情况目前各地现存的大量贞节牌坊和史书中连篇累牍的节妇传就是当时历史的真实反映但是在历史上, 现实生活中的寡妇守志并非一件易事很多寡妇

.' 6! "! 6 "'' 6 7% $! 7%/'& 人类非洲锥虫病又称昏睡病是布氏锥虫冈比亚亚种!! 或布氏锥虫罗得西亚种 "#$$ %! &'!!! 感染引起的一种寄生虫病以采采蝇! 为传播 ' 媒

《飘》（二）

《飘》（一）

使用 Java 语言模拟保险箱容量门板厚度箱体厚度属性锁具类型开保险箱关保险箱动作存取款

《无事生非》

$ $ $ %&!! ( )!"" " * ) " +! + ("$ + ) * "! ",! + " +! $, ( * " -. / !!""! %! * " 2 & * 345! + " %! + )! %! + )!!! (!"" ( ) ( + ) * + * 2!( *!)

!"#$!" %&' ００２!!""#$ #%#$&'( )*%&'( &%('& 多坎坷后来虽说政治清明还了他一身清白却已到了退休年纪我们有兄妹三人我是长子在我们心目中父亲不仅有学问而且是一个十分正直又很能体贴别人的人妈妈也是一个小学教师慈祥和善我们全家从未与别人

《复活》（下）

小说天地欲望摩托尚成河血溅维纳斯刘步明长调短歌海上天湖李转生目海尖高处的三种陈述谢应华乡村笔记阿曼桃花渡林小耳种诗歌江良热雨花石张彩霞刊名书法陈奋武屏

** 状态二亚健康亚健康是指处于健康和疾病两者之间的一种状态即机体内出现某些功能紊乱但未影响到行使社会功能主观上有不适感觉它是人体处于健康和疾病之间的过渡

Microsoft Word - HERBRECIPES《中國藥膳》.doc

循经指压疗法

毛主席的猪

从因人设事谈起一部文学作品 ( 尤其是长篇小说 ) 的结构至关重要, 因为它是文本整体的组织方式和内部构造, 既是形式又是内容 ; 乃是表达主题最有效的艺术手段元代戏曲

北魏山东佛教文化个案研究

《大话设计模式》第一章

PowerPoint 演示文稿

(TestFailure) JUnit Framework AssertionFailedError JUnit Composite TestSuite Test TestSuite run() run() JUnit

新时期共青团工作实务全书（一百七十一）

《呼啸山庄》（上）

Microsoft Word - 第3章.doc

7 海外检验医学 551 年月第卷第 ( 期 4: & % #4)!& 551 ; ) ( 工具的发展如!$!6: 0 " :6" 和 86. 工具,*9 这些工具使用某些风险因素例如吸烟血压和脂质及其他变量如年龄

# # # # # # # # # % # & # & # # # () # (( # * * (( # (+ # ( (# # (# # (# # ( # ( +) (

1.JasperReport ireport JasperReport ireport JDK JDK JDK JDK ant ant...6

/0/ "!!!!! " "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! " # $ % && $ $ $ $ ( $ $ ( $ ) % * ( * $ $ $ $ $ $ $ ( $ $ $ $ $ # ( $ $ ( $ $ $ ( $ $ $ $

(2) 廠商具有維修維護或售後服務能力之證明 ;(3) 廠商具有製造供應或承做能力之證明 ;(4) 具有相當人力之證明屬特定資格之ㄧ 8.(3) 機關辦理預算金額為新台幣四億元

, 2., 3., , 3.,,

"#$% "& ""!!!!! "!! # $ % $$ & #$%& $(#$)&* $+)*#, $ #$%#&

年第期许德刚基于遗忘因子 -./ 算法的自适应直达波对消技术 * 达站周围的环境可能比较复杂来自近距离不同固定物体所反射的多径信号也强于回波信号大大影响了雷达的

/ 第卷 (!(" $& $% $%% $$/,!. $"($ ) 0 %'&.(!.' (!' 0 %$ $'#78#/8# 8#$/!),% 3 -+ /! ", $ % +'!)%+%$" $ %'+(("& +'!) "'$,'(% -' (!' 0 %$ $'18 #88 #88!)(!

论文 :?,,,,,,,,,, (, ),, ( ),,,,,,,, (, ) : (, ),,, :,, ;,,,,

中国中西医结合杂志年月第卷第期!"# $! )# 5! ) 3& &!" &"& & 4! (& )& * ) 55& " )! & 5 )!4 ( )&& & )&# 1-9,6 & 7! &) (& (& 5 ) & " 3!4 5! ) &"&!)! & ) " &

3.1 num = 3 ch = 'C' 2

《娜娜》（下）

动物中能促进但会在表达的物种中产生不良反应如引起脂肪肝或升高 74-4 水平 2 # ) 9 等建立血脂异常和肝硬化仓鼠模型进行研究结果表明 7'&$ 不能改善血脂异常和肝硬

度方面对护士的整个抢救过程进行评价医生对护士抢救配合满意度为对患儿首次评估的正确表 & 快捷急救护理记录表性医嘱的执行力对患儿抢救药物使用后的再次评估合作

中国中西医结合杂志年月第卷第期 *. *, *. * * 4 +* ) ), ) 3, +3 ),, )., +3, ), +3* *. *, +. 3, * 4 +*, ) 3, +3 ),, )., +3 ),., *. * * 4 +* ) ), ) 3, +3 )

工程应用陈泾生等继电保护检验标准化作业专家系统的研发和应用实践统硬件结构和软件功能结构分别如图图所示图 / 系统硬件架构 0 1/!&%!!" "! 图软件功能 0 1 %! " 高

!"#$!"#$%&$'!"#$ %$&' ()* +,-& 摄影作品 # $" 图片新闻 # $% 走近滨江 # $& 企业之窗汽修中心工会成功举办!"#$ 年元旦迎新登山活动 # "% 第十六次!春风行动"报道 # "% 修理团队深夜连续奋战立足市场拓展业务面对极寒天保障人民群众顺利回

食用蘑菇尤其是在上流社会非常流行据传古罗马的凯撒大帝在食用蘑菇膳肴前有专门的侍者先行鉴尝蘑菇是否有毒以确保食用安全在世界其他地方如墨西哥俄罗斯以及一些

Transcription:

Hadoop 程式設計五開發 Hadoop Map/Reduce 程式設計者只需要解決真實的問題, 架構面留給 MapReduce 96

Hadoop 的 MapReduce API 提供自動的平行化與工作分配容錯特性狀態監控工具一個乾淨的抽象化 (abstration) 供程式設計師使用 97

HDFS & MapReduce HDFS Input Output 部份圖片來源 :http://www.larsgeorge.com/2009/05/hbase-mapreduce-101-part-i.html 98

<Key, Value> Pair Input Row Data Map Output key1 key2 key1 val val val Map Select Key Reduce Input key1 val. val val Reduce Output key values 99

Program Prototype (v 0.20) Map 區 Reduce 區設定區 Class MR{ static public Class Mapper { Map 程式碼 static public Class Reducer { Reduce 程式碼 main(){ Configuration conf = new Configuration(); Job job = new Job(conf, job name"); job.setjarbyclass(thismainclass.class); job.setmapperclass(mapper.class); job.setreduceclass(reducer.class); FileInputFormat.addInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); 其他的設定參數程式碼 job.waitforcompletion(true); 100

Class Mapper (v 0.20) 1 2 3 4 5 6 7 8 9 import org.apache.hadoop.mapreduce.mapper; class MyMap extends Mapper < INPUT INPUT OUTPUT OUTPUT KEY, VALUE, KEY, VALUE > { Class Class Class Class // 全域變數區 INPUT INPUT public void map ( KEY key, VALUE value, Class Class Context context )throws IOException,InterruptedException { // 區域變數與程式邏輯區 context.write( NewKey, NewValue); 101

Class Reducer (v 0.20) 1 2 3 4 5 6 7 8 9 import org.apache.hadoop.mapreduce.reducer; MyRed class MyRed extends INPUT INPUT OUTPUT OUTPUT Reducer <,,, > KEY VALUE KEY VALUE { Class Class Class Class // 全域變數區 public void reduce ( INPUT KEY Class key, Iterable< VALUE > values, Context context) throws IOException,InterruptedException { // 區域變數與程式邏輯區 context.write( NewKey, NewValue); INPUT Class 102

其他常用的設定參數設定 Combiner Job.setCombinerClass ( ); 設定 output class Job.setMapOutputKeyClass( ); Job.setMapOutputValueClass( ); Job.setOutputKeyClass( ); Job.setOutputValueClass( ); 103

Class Combiner 指定一個 combiner, 它負責對中間過程的輸出進行聚集, 這會有助於降低從 Mapper 到 Reducer 數據傳輸量可不用設定交由 Hadoop 預設也可不實做此程式, 引用 Reducer 設定 JobConf.setCombinerClass(Class) 104

範例一 (1) - mapper public class HelloHadoop { static public class HelloMapper extends Mapper<LongWritable, Text, LongWritable, Text> { public void map(longwritable key, Text value, Context context) throws IOException, InterruptedException { context.write((longwritable) key, (Text) value); // HelloReducer end..( 待續 ) 105

範例一 (2) - reducer static public class HelloReducer extends Reducer<LongWritable, Text, LongWritable, Text> { public void reduce(longwritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException { Text val = new Text(); for (Text str : values) { val.set(str.tostring()); context.write(key, val); // HelloReducer end..( 待續 ) 106

範例一 (3) - main public static void main(string[] args) throws IOException, InterruptedException, ClassNotFoundException { Configuration conf = new Configuration(); Job job = new Job(conf, "Hadoop Hello World"); job.setjarbyclass(hellohadoop.class); FileInputFormat.setInputPaths(job, "input"); FileOutputFormat.setOutputPath(job, new Path("output-hh1")); job.setmapperclass(hellomapper.class); job.setreducerclass(helloreducer.class); job.waitforcompletion(true); // main end // wordcount class end // 完 107

Hadoop 程式設計七 Hadoop 程式範例 7.1:HDFS 操作篇 7.2:MapReduce 運算篇 108

傳送檔案至 HDFS // 將檔案從 local 上傳到 hdfs, src 為 local 的來源, dst 為 hdfs 的目的端 public class PutToHdfs { static boolean puttohdfs(string src, String dst, Configuration conf) { Path dstpath = new Path(dst); try { // 產生操作 hdfs 的物件 FileSystem hdfs = dstpath.getfilesystem(conf); // 上傳 hdfs.copyfromlocalfile(false, new Path(src),new Path(dst)); catch (IOException e) { e.printstacktrace(); return false; return true; 109

從 HDFS 取回檔案 // 將檔案從 hdfs 下載回 local, src 為 hdfs 的來源, dst 為 local 的目的端 public class GetFromHdfs { static boolean getfromhdfs(string src,string dst, Configuration conf) { Path dstpath = new Path(src); try { // 產生操作 hdfs 的物件 FileSystem hdfs = dstpath.getfilesystem(conf); // 下載 hdfs.copytolocalfile(false, new Path(src),new Path(dst)); catch (IOException e) { e.printstacktrace(); return false; return true; 110

檢查與刪除檔案 // checkanddelete 函式, 檢查是否存在該資料夾, 若有則刪除之 public class CheckAndDelete { static boolean checkanddelete(final String path, Configuration conf) { Path dst_path = new Path(path); try { // 產生操作 hdfs 的物件 FileSystem hdfs = dst_path.getfilesystem(conf); // 檢查是否存在 if (hdfs.exists(dst_path)) { // 有則刪除 hdfs.delete(dst_path, true); catch (IOException e) { e.printstacktrace(); return false; return true; 111

Hadoop 程式設計七 Hadoop 程式範例 7.1:HDFS 操作篇 7.2:MapReduce 運算篇 112

範例二 (1) HelloHadoopV2 說明 : 此程式碼比 HelloHadoop 增加了 * 檢查輸出資料夾是否存在並刪除 * input 資料夾內的資料若大於兩個, 則資料不會被覆蓋 * map 與 reduce 拆開以利程式再利用測試方法 : 將此程式運作在 hadoop 0.20 平台上, 執行 : --------------------------- hadoop jar V2.jar HelloHadoopV2 --------------------------- 注意 : 1. 在 hdfs 上來源檔案的路徑為 "/user/$your_name/input", 請注意必須先放資料到此 hdfs 上的資料夾內, 且此資料夾內只能放檔案, 不可再放資料夾 2. 運算完後, 程式將執行結果放在 hdfs 的輸出路徑為 "/user/$your_name/output-hh2" 113

範例二 (2) public class HelloHadoopV2 { public static void main(string[] args) throws IOException, InterruptedException, ClassNotFoundException { Configuration conf = new Configuration(); Job job = new Job(conf, "Hadoop Hello World 2"); job.setjarbyclass(hellohadoopv2.class); // 設定 map and reduce 以及 Combiner class job.setmapperclass(hellomapperv2.class); job.setcombinerclass(helloreducerv2.class); job.setreducerclass(helloreducerv2.class); // 設定 map 的輸出型態 job.setmapoutputkeyclass(text.class); job.setmapoutputvalueclass(text.class); // 設定 reduce 的輸出型態 job.setoutputkeyclass(text.class); job.setoutputvalueclass(text.class); FileInputFormat.addInputPath (job, new Path("input")); FileOutputFormat.setOutputPath (job, new Path("output-hh2")); // 呼叫 checkanddelete 函式, // 檢查是否存在該資料夾, 若有則刪除之 CheckAndDelete.checkAndDelete("output-hh2", conf); boolean status = job.waitforcompletion(true); if (status) { System.err.println("Integrate Alert Job Finished!"); else { System.err.println("Integrate Alert Job Failed!"); System.exit(1); 114

範例二 (3) public class HelloMapperV2 extends Mapper <LongWritable, Text, Text, Text> { public void map(longwritable key, Text value, Context context) throws IOException, InterruptedException { context.write(new Text(key.toString()), value); public class HelloReducerV2 extends Reducer<Text, Text, Text, Text> { public void reduce(text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { String str = new String(""); Text final_key = new Text(); Text final_value = new Text(); // 將 key 值相同的 values, 透過 && 符號分隔之 for (Text tmp : values) { str += tmp.tostring() + " &&"; final_key.set(key); final_value.set(str); context.write(final_key, final_value); 115

範例三 (1) HelloHadoopV3 說明 : 此程式碼再利用了 HelloHadoopV2 的 map, reduce 檔, 並且自動將檔案上傳到 hdfs 上運算並自動取回結果, 還有提示訊息參數輸入與印出運算時間的功能測試方法 : 將此程式運作在 hadoop 0.20 平台上, 執行 : --------------------------- hadoop jar V3.jar HelloHadoopV3 <local_input> <local_output> --------------------------- 注意 : 1. 第一個輸入的參數是在 local 的輸入資料夾, 請確認此資料夾內有資料並無子目錄 2. 第二個輸入的參數是在 local 的運算結果資料夾, 由程式產生不用事先建立, 若有請刪除之 116

範例三 (2) public class HelloHadoopV3 { public static void main(string[] args) throws IOException, InterruptedException, ClassNotFoundException { String hdfs_input = "HH3_input"; String hdfs_output = "HH3_output"; Configuration conf = new Configuration(); // 宣告取得參數 String[] otherargs = new GenericOptionsParser(conf, args).getremainingargs(); // 如果參數數量不為 2 則印出提示訊息 if (otherargs.length!= 2) { System.err.println("Usage: hadoop jar HelloHadoopV3.jar <local_input> <local_output>"); System.exit(2); Job job = new Job(conf, "Hadoop Hello World"); job.setjarbyclass(hellohadoopv3.class); // 再利用上個範例的 map 與 reduce job.setmapperclass(hellomapperv2.class); job.setcombinerclass(helloreducerv2.class); job.setreducerclass(helloreducerv2.class); // 設定 map reduce 的 key value 輸出型態 job.setmapoutputkeyclass(text.class); job.setmapoutputvalueclass(text.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(text.class); 117

範例三 (2) // 用 checkanddelete 函式防止 overhead 的錯誤 CheckAndDelete.checkAndDelete(hdfs_input, conf); CheckAndDelete.checkAndDelete(hdfs_output, conf); // 放檔案到 hdfs PutToHdfs.putToHdfs(args[0], hdfs_input, conf); // 設定 hdfs 的輸入輸出來源路定 FileInputFormat.addInputPath(job, new Path(hdfs_input)); FileOutputFormat.setOutputPath(job, new Path(hdfs_output)); long start = System.nanoTime(); job.waitforcompletion(true); // 把 hdfs 的結果取下 GetFromHdfs.getFromHdfs(hdfs_output, args[1], conf); boolean status = job.waitforcompletion(true); // 計算時間 if (status) { System.err.println("Integrate Alert Job Finished!"); long time = System.nanoTime() - start; System.err.println(time * (1E-9) + " secs."); else { System.err.println("Integrate Alert Job Failed!"); System.exit(1); 118

範例四 (1) public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); String[] otherargs = new GenericOptionsParser(conf, args).getremainingargs(); if (otherargs.length!= 2) { System.err.println("Usage: hadoop jar WordCount.jar <input> <output>"); System.exit(2); Job job = new Job(conf, "Word Count"); job.setjarbyclass(wordcount.class); job.setmapperclass(tokenizermapper.class); job.setcombinerclass(intsumreducer.class); job.setreducerclass(intsumreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); CheckAndDelete.checkAndDelete(args[1], conf); System.exit(job.waitForCompletion(true)? 0 : 1); 119

範例四 (2) 1 class TokenizerMapper extends Mapper<LongWritable, Text, Text, IntWritable> { 2 private final static IntWritable one = new IntWritable(1); 3 private Text word = new Text(); 4 public void map( LongWritable key, Text value, Context context) 5 6 7 8 9 Input key /user/hadooper/input/a.txt. No news is a good news. throws IOException, InterruptedException { String line = ((Text) value).tostring(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasmoretokens()) { word.set(itr.nexttoken()); context.write(word, one); Input value line itr no news is a good news itr itr itr itr itr itr <word,one> < no, 1 > < news, 1 > < is, 1 > < a, 1 > < good, 1 > < news, 1 > 120

範例四 (3) 1 2 3 4 5 6 7 8 class IntSumReducer extends Reducer< Text, IntWritable, Text, IntWritable> { IntWritable result = new IntWritable(); public void reduce( Text key, Iterable <IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for ( IntWritable val : values ) sum += val.get(); result.set(sum); context.write ( key, result); for ( int i ; i < values.length ; i ++ ){ sum += values[i].get() <word,one> < a, 1 > < good, 1 > < is, 1 > < news, 1 1 > < no, 1 > news 1 1 <key,sunvalue> < news, 2 > 121

範例五 (1) WordCountV2 說明 : 用於字數統計, 並且增加略過大小寫辨識符號篩除等功能測試方法 : 將此程式運作在 hadoop 0.20 平台上, 執行 : --------------------------- hadoop jar WCV2.jar WordCountV2 -Dwordcount.case.sensitive=false \ <input> <output> -skip patterns/patterns.txt --------------------------- 注意 : 1. 在 hdfs 上來源檔案的路徑為你所指定的 <input> 請注意必須先放資料到此 hdfs 上的資料夾內, 且此資料夾內只能放檔案, 不可再放資料夾 2. 運算完後, 程式將執行結果放在 hdfs 的輸出路徑為你所指定的 <output> 3. 請建立一個資料夾 pattern 並在裡面放置 pattern.txt, 內容如 ( 一行一個, 前置提示符號 \) \. \, \! 122

範例五 (2) public class WordCountV2 extends Configured implements Tool { public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { static enum Counters { INPUT_WORDS private final static IntWritable one = new IntWritable(1); private Text word = new Text(); private boolean casesensitive = true; private Set<String> patternstoskip = new HashSet<String>(); private long numrecords = 0; private String inputfile; public void configure(jobconf job) { casesensitive = job.getboolean("wordcount.case.sensitive", true); inputfile = job.get("map.input.file"); if (job.getboolean("wordcount.skip.patterns", false)) { Path[] patternsfiles = new Path[0]; try { patternsfiles = DistributedCache.getLocalCacheFiles(job); catch (IOException ioe) { System.err.println("Caught exception while getting cached files: " + StringUtils.stringifyException(ioe)); for (Path patternsfile : patternsfiles) { parseskipfile(patternsfile); private void parseskipfile(path patternsfile) { try { BufferedReader fis = new BufferedReader(new FileReader( patternsfile.tostring())); String pattern = null; while ((pattern = fis.readline())!= null) { patternstoskip.add(pattern); catch (IOException ioe) { System.err.println("Caught exception while parsing the cached file '"+ patternsfile + "' : " + tringutils.stringifyexception(ioe)); public void map(longwritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = (casesensitive)? value.tostring() : value.tostring().tolowercase(); for (String pattern : patternstoskip) line = line.replaceall(pattern, ""); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); output.collect(word, one); reporter.incrcounter(counters.input_words, 1); 123

範例五 (3) if ((++numrecords % 100) == 0) { reporter.setstatus("finished processing " + numrecords + " records " + "from the input file: " + inputfile); public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasnext()) { sum += values.next().get(); output.collect(key, new IntWritable(sum)); public int run(string[] args) throws Exception { JobConf conf = new JobConf(getConf(), WordCount.class); conf.setjobname("wordcount"); String[] otherargs = new GenericOptionsParser(conf, args).getremainingargs(); if (otherargs.length < 2) { System.out.println("WordCountV2 [- Dwordcount.case.sensitive=<false true>] \\ "); System.out.println(" <indir> <outdir> [-skip Pattern_file]"); return 0; conf.setoutputkeyclass(text.class); conf.setoutputvalueclass(intwritable.class); conf.setmapperclass(map.class); conf.setcombinerclass(reduce.class); conf.setreducerclass(reduce.class); conf.setinputformat(textinputformat.class); conf.setoutputformat(textoutputformat.class); List<String> other_args = new ArrayList<String>(); for (int i = 0; i < args.length; ++i) { if ("-skip".equals(args[i])) { DistributedCache.addCacheFile(new Path(args[++i]).toUri(), conf); conf.setboolean("wordcount.skip.patterns", true); else {other_args.add(args[i]); FileInputFormat.setInputPaths(conf, new Path(other_args.get(0))); FileOutputFormat.setOutputPath(conf, new Path(other_args.get(1))); CheckAndDelete.checkAndDelete(other_args.get(1), conf); JobClient.runJob(conf); return 0; public static void main(string[] args) throws Exception { int res = ToolRunner.run(new Configuration(), new WordCountV2(), args); System.exit(res); 124

說明 : 範例六 (1) WordIndex 將每個字出於哪個檔案, 那一行印出來測試方法 : 將此程式運作在 hadoop 0.20 平台上, 執行 : --------------------------- hadoop jar WI.jar WordIndex <input> <output> --------------------------- 注意 : 1. 在 hdfs 上來源檔案的路徑為你所指定的 <input> 請注意必須先放資料到此 hdfs 上的資料夾內, 且此資料夾內只能放檔案, 不可再放資料夾 2. 運算完後, 程式將執行結果放在 hdfs 的輸出路徑為你所指定的 <output> 125

public class WordIndex { public static class wordindexm extends Mapper<LongWritable, Text, Text, Text> { public void map(longwritable key, Text value, Context context) throws IOException, InterruptedException { FileSplit filesplit = (FileSplit) context.getinputsplit(); 範例六 (2) static public class wordindexr extends Reducer<Text, Text, Text, Text> { public void reduce(text key, Iterable<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { String v = ""; Text map_key = new Text(); StringBuilder ret = new StringBuilder("\n"); Text map_value = new Text(); for (Text val : values) { String line = value.tostring(); v += val.tostring().trim(); StringTokenizer st = new StringTokenizer(line.toLowerCase()); if (v.length() > 0) while (st.hasmoretokens()) { ret.append(v + "\n"); String word = st.nexttoken(); map_key.set(word); output.collect((text) key, new map_value.set(filesplit.getpath().getname() + Text(ret.toString())); ":" + line); context.write(map_key, map_value); 126

範例六 (2) public static void main(string[] args) throws IOException, InterruptedException, ClassNotFoundException { Configuration conf = new Configuration(); String[] otherargs = new GenericOptionsParser(conf, args).getremainingargs(); if (otherargs.length < 2) { System.out.println("hadoop jar WordIndex.jar <indir> <outdir>"); return; Job job = new Job(conf, "word index"); job.setjobname("word inverted index"); job.setjarbyclass(wordindex.class); job.setmapoutputkeyclass(text.class); job.setmapoutputvalueclass(text.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(text.class); job.setmapperclass(wordindexm.class); job.setreducerclass(wordindexr.class); job.setcombinerclass(wordindexr.class); FileInputFormat.setInputPaths(job, args[0]); CheckAndDelete.checkAndDelete(args[1], conf); FileOutputFormat.setOutputPath(job, new Path(args[1])); long start = System.nanoTime(); job.waitforcompletion(true); long time = System.nanoTime() - start; System.err.println(time * (1E-9) + " secs."); 127

範例七 (1) YourMenu 說明 : 將之前的功能整合起來測試方法 : 將此程式運作在 hadoop 0.20 平台上, 執行 : --------------------------- hadoop jar YourMenu.jar < 功能 > --------------------------- 注意 : 1. 此程式需與之前的所有範例一起打包成一個 jar 檔 128

範例七 (2) public class YourMenu { public static void main(string argv[]) { int exitcode = -1; ProgramDriver pgd = new ProgramDriver(); if (argv.length < 1) { System.out.print("********************************** ********\n" + " 歡迎使用 NCHC 的運算功能 \n" + " 指令 : \n" + " Hadoop jar NCHC-example-*.jar < 功能 > \n" + " 功能 : \n" + " HelloHadoop: 秀出 Hadoop 的 <Key,Value> 為何 \n" + " HelloHadoopV2: 秀出 Hadoop 的 <Key,Value> 進階版 \n" + " HelloHadoopV3: 秀出 Hadoop 的 <Key,Value> 進化版 \n" + " WordCount: 計算輸入資料夾內分別在每個檔案的字數統計 \n" + " WordCountV2: WordCount 進階版 \n" + " WordIndex: 索引每個字與其所有出現的所在列 \n" + "******************************************\n"); else { try { pgd.addclass("hellohadoop", HelloHadoop.class, " Hadoop hello world"); pgd.addclass("hellohadoopv2", HelloHadoopV2.class, " Hadoop hello world V2"); pgd.addclass("hellohadoopv3", HelloHadoopV3.class, " Hadoop hello world V3"); pgd.addclass("wordcount", WordCount.class, " word count."); pgd.addclass("wordcountv2", WordCountV2.class, " word count V2."); pgd.addclass("wordindex", WordIndex.class, "invert each word in line"); pgd.driver(argv); // Success exitcode = 0; System.exit(exitCode); catch (Throwable e) { e.printstacktrace(); 129

補充 Program Prototype (v 0.18) Map 區 Reduce 區設定區 Class MR{ Class Mapper { Class Reducer { main(){ Map 程式碼 Reduce 程式碼 JobConf conf = new JobConf( MR.class ); conf.setmapperclass(mapper.class); conf.setreduceclass(reducer.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); 其他的設定參數程式碼 JobClient.runJob(conf); 130

補充 Class Mapper (v0.18) 1 2 3 4 5 6 7 8 9 import org.apache.hadoop.mapred.*; MyMap class MyMap extends MapReduceBase INPUT INPUT OUTPUT OUTPUT implements Mapper < KEY, VALUE, KEY, VALUE > { // 全域變數區 INPUT INPUT public void map ( KEY key, VALUE value, OUTPUT OUTPUT OutputCollector<, > output, KEY VALUE Reporter reporter) throws IOException { // 區域變數與程式邏輯區 output.collect( NewKey, NewValue); 131

補充 Class Reducer (v0.18) import org.apache.hadoop.mapred.*; 1 2 3 4 5 6 7 8 9 class MyRed extends MapReduceBase INPUT INPUT OUTPUT OUTPUT implements Reducer < KEY, VALUE, KEY, VALUE > { // 全域變數區 INPUT KEY INPUT public void reduce ( key, Iterator< VALUE > values, OUTPUT OUTPUT OutputCollector<, > output, KEY VALUE Reporter reporter) throws IOException { // 區域變數與程式邏輯區 output.collect( NewKey, NewValue); 132

Conclusions 以上範例程式碼包含 Hadoop 的 key,value 架構操作 Hdfs 檔案系統 Map Reduce 運算方式執行 hadoop 運算時, 程式檔不用上傳至 hadoop 上, 但資料需要再 HDFS 內可運用範例七的程式達成連續運算 Hadoop 0.20 與 Hadoop 0.18 有些 API 有些許差異, 因此在網路上找到 Hadoop 的程式如果 compiler 有錯, 可以換換對應的 Function 試試 133