Hadoop 元数据合并异常及解决方法 这几天观察了一下 Standby NN 上面的日志, 发现每次 Fsimage 合并完之后,Standby NN 通知 Active NN 来下载合并好的 Fsimage 的过程中会出现以下的异常信息 : 2014-04-23 14:42:54,964 ERROR org.apache.hadoop.hdfs.server.namenode.ha. StandbyCheckpointer: Exception in docheckpoint java.net.sockettimeoutexception: Read timed out at java.net.socketinputstream.socketread0(native Method) at java.net.socketinputstream.read(socketinputstream.java:152) at java.net.socketinputstream.read(socketinputstream.java:122) at java.io.bufferedinputstream.fill(bufferedinputstream.java:235) at java.io.bufferedinputstream.read1(bufferedinputstream.java:275) at java.io.bufferedinputstream.read(bufferedinputstream.java:334) at sun.net.www.http.httpclient.parsehttpheader(httpclient.java:687) at sun.net.www.http.httpclient.parsehttp(httpclient.java:633) at sun.net.www.protocol.http.httpurlconnection.getinputstream (HttpURLConnection.java:1323) at java.net.httpurlconnection.getresponsecode(httpurlconnection.java:468) at org.apache.hadoop.hdfs.server.namenode.transferfsimage.dogeturl (TransferFsImage.java:268) at org.apache.hadoop.hdfs.server.namenode.transferfsimage.getfileclient (TransferFsImage.java:247) at org.apache.hadoop.hdfs.server.namenode.transferfsimage. uploadimagefromstorage(transferfsimage.java:162). docheckpoint(standbycheckpointer.java:174). access$1100(standbycheckpointer.java:53) $CheckpointerThread.doWork(StandbyCheckpointer.java:297) $CheckpointerThread.access$300(StandbyCheckpointer.java:210) $CheckpointerThread$1.run(StandbyCheckpointer.java:230) at org.apache.hadoop.security.securityutil.doasloginuserorfatal (SecurityUtil.java:456) $CheckpointerThread.run(StandbyCheckpointer.java:226) 1 / 5
上面的代码贴出来有点乱啊, 可以看下下面的图片截图 : StandbyCheckpointer 于是习惯性的去 Google 了一下, 找了好久也没找到类似的信息 只能自己解决 我们通过分析日志发现更奇怪的问题, 上次 Checkpoint 的时间一直都不变 ( 一直都是 Standby NN 启动的时候第一次 Checkpoint 的时间 ), 如下 : 2014-04-23 14:50:54,429 INFO org.apache.hadoop.hdfs.server.namenode.ha.standbycheckpointer: Triggering checkpoint because it has been 70164 seconds since the last checkpoint, which exceeds the configured interval 600 难道这是 Hadoop 的 bug? 于是我就根据上面的错误信息去查看源码, 经过仔细的分析, 发现上述的问题都是由 StandbyCheckpointer 类输出的 : private void dowork() { // Reset checkpoint time so that we don't always checkpoint // on startup. lastcheckpointtime = now(); while (shouldrun) { try { Thread.sleep(1000 * checkpointconf.getcheckperiod()); catch (InterruptedException ie) { if (!shouldrun) { break; try { // We may have lost our ticket since last checkpoint, log in again, // just in case if (UserGroupInformation.isSecurityEnabled()) { UserGroupInformation.getCurrentUser().checkTGTAndReloginFromKeytab(); long now = now(); long uncheckpointed = countuncheckpointedtxns(); long secssincelast = (now - lastcheckpointtime)/1000; 2 / 5
boolean needcheckpoint = false; if (uncheckpointed >= checkpointconf.gettxncount()) { LOG.info("Triggering checkpoint because there have been " + uncheckpointed + " txns since the last checkpoint, which " + "exceeds the configured threshold " + checkpointconf.gettxncount()); needcheckpoint = true; else if (secssincelast >= checkpointconf.getperiod()) { LOG.info("Triggering checkpoint because it has been " + secssincelast + " seconds since the last checkpoint, which " + "exceeds the configured interval " + checkpointconf.getperiod()); needcheckpoint = true; synchronized (cancellock) { if (now < preventcheckpointsuntil) { LOG.info("But skipping this checkpoint since we are about"+ " to failover!"); canceledcount++; continue; assert canceler == null; canceler = new Canceler(); if (needcheckpoint) { docheckpoint(); lastcheckpointtime = now; catch (SaveNamespaceCancelledException ce) { LOG.info("Checkpoint was cancelled: " + ce.getmessage()); canceledcount++; catch (InterruptedException ie) { // Probably requested shutdown. continue; catch (Throwable t) { LOG.error("Exception in docheckpoint", t); finally { synchronized (cancellock) { canceler = null; 3 / 5
上面的异常信息是由 docheckpoint() 函数执行的过程中出现问题而抛出来的, 这样导致 lastcheckpointtime = now; 语句永远执行不到 那么为什么 docheckpoint() 执行过程会出现异常?? 根据上述堆栈信息的跟踪, 发现是由 TransferFsImage 类的 dogeturl 函数中的下面语句导致的 : if (connection.getresponsecode()!= HttpURLConnection.HTTP_OK) { 由于 connection 无法得到对方的响应码而超时 于是我就想到是否是我的集群 socket 超时设置的有问题?? 后来经过各种分析发现不是 于是我只能再看看代码, 我发现了上述代码的前面有如下设置 : if (timeout <= 0) { Configuration conf = new HdfsConfiguration(); timeout = conf.getint(dfsconfigkeys.dfs_image_transfer_timeout_key, DFSConfigKeys.DFS_IMAGE_TRANSFER_TIMEOUT_DEFAULT); if (timeout > 0) { connection.setconnecttimeout(timeout); connection.setreadtimeout(timeout); if (connection.getresponsecode()!= HttpURLConnection.HTTP_OK) { throw new HttpGetFailedException( "Image transfer servlet at " + url + " failed with status code " + connection.getresponsecode() + "\nresponse message:\n" + connection.getresponsemessage(), connection); DFS_IMAGE_TRANSFER_TIMEOUT_KEY 这个时间是由 dfs.image.transfer.timeout 参数所设置的, 默认值为 10 * 60 * 1000, 单位为毫秒 然后我看了一下这个属性的解释 : Timeout for image transfer in milliseconds. This timeout and the related dfs.image.transfer.bandwidthpersec parameter should be configured such that normal image transfer can complete within the timeout. This timeout prevents client hangs when the sender 4 / 5
Powered by TCPDF (www.tcpdf.org) Hadoop 元数据合并异常及解决方法 fails during image transfer, which is particularly important during checkpointing. Note that this timeout applies to the entirety of image transfer, and is not a socket timeout. 这才发现问题, 这个参数的设置和 dfs.image.transfer.bandwidthpersec 息息相关, 要保证 Active NN 在 dfs.image.transfer.timeout 时间内把合并好的 Fsimage 从 Standby NN 上下载完, 要不然会出现异常 然后我看了一下我的配置 <property> <name>dfs.image.transfer.timeout</name> <value>60000</value> </property> <property> <name>dfs.image.transfer.bandwidthpersec</name> <value>1048576</value> </property> 60 秒超时, 一秒钟拷贝 1MB, 而我的集群上的元数据有 800 多 MB, 显然是不能在 60 秒钟拷贝完, 后来我把 dfs.image.transfer.timeout 设置大了, 观察了一下, 集群再也没出现过上述异常信息, 而且以前的一些异常信息也由于这个而解决了 本博客文章除特别声明, 全部都是原创! 转载本文请加上 : 转载自过往记忆 (https://www.iteblog.com/) 本文链接 : () 5 / 5