视频场景重建的流程运动恢复结构深度恢复三维重建

运动恢复结构章国锋浙江大学 CAD&CG 国家重点实验室

针孔相机模型投影方程 : 齐次坐标表示 : Richard Hartley and Andrew Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, Second Edition 2004.

1 0 1 0 1 0 1 1 Z Y X f f Z fy fx 针孔相机模型 K [R t] Richard Hartley and Andrew Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, Second Edition 2004.

主点的偏移 æ ç çç ç è fx / Z + x 0 fy / Z + y 0 1 ö ø ~ æ ç çç ç è fx + Zx 0 fy + Zy 0 Z ö ø é ê = êê êë f x 0 0 f y 0 0 1 0 ùæ úç úú çç úû ç è X Y Z 1 ö ø Richard Hartley and Andrew Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, Second Edition 2004.

相机的外部参数 Richard Hartley and Andrew Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, Second Edition 2004.

透视相机模型 K f x x 0 f y cy 0 s 0 c 1 P K R t 11 DoF (5+3+3)

径向畸变比如鱼眼镜头 : 数学模型 : R R 2 2 2 2 2 ( x, y) (1 K ( x y ) K ( x y )...) x 1 2 y (Marc Pollefeys)

径向畸变矫正例子 (Marc Pollefeys)

多视图几何运动恢复结构从多张图像或视频序列中自动地恢复相机参数和场景的三维结构. Noah Snavely, Steven M. Seitz, Richard Szeliski. "Photo tourism: Exploring photo collections in 3D". 2016.

双视图几何 3D???

双视图几何 3D: 极线几何

极线几何

基础矩阵只跟两个视图的相对相机姿态和内参有关 F 是一个 3 3 秩为 2 的矩阵 Fe = 0 7 个自由度最少 7 对匹配点就可以求解 F 七点法八点法 K [ t T 2 ] RK 1 1 OpenCV: cvfindfundamentalmat()

八点法求解基础矩阵根据对极几何关系, 基本矩阵 F 满足若设那么对极几何关系又可以写作 : 若存在 n 对对应点,F 应满足如下的线性系统 :

八点法求解基础矩阵 f 为 9 维向量, 若要有解,rank(A) 至多为 8 在 rank(a) = 8 时,f 的方向是唯一的通过至少 8 对对应点, 可恰好得到使 f 方向唯一的 A f 为 A 的右零空间的基向量, 可用 svd(a) 求得真实数据存在噪音, 大于 8 组对应点得到的 A 满秩即 rank(a) = 9 此时同样可计算 (U,Σ,V) = svd(a) 令 f 为 V 中对应最小奇异值的列向量

多视图几何投影函数

运动恢复结构流程特征跟踪获得一堆特征轨迹运动恢复结构求解相机参数和特征轨迹的三维位置

图像特征图像中显著容易区分和匹配的内容不变性点角点线 : 直线, 曲线, 边 : 二维边, 三维边形状 : 长方形, 圆, 椭圆, 球, 纹理视角不变 ( 尺度, 方向, 平移 ) 光照不变物体变形部分遮挡

Harris 角点检测核心思想 : 统计图像梯度的分布平滑区域 : 梯度不明显边缘区域 : 梯度明显, 方向一致角落区域 : 梯度明显, 方向不一致方法 : 计算像素邻域的梯度二阶矩计算上述矩阵的角点响应指标对 R 进行阈值过滤和非极大值抑制

FAST 通过直接的阈值和判断来加速角点提取考虑中心点周围的 16 个像素, 设中心点亮度为 p 如果有连续 n 个像素亮度都大于 p+t, 或者都小于 p-t ( 如图中的 14~16, 1 ~ 6) 检查 1 5 9 13 四个位置, 如果是角点, 四个位置中应当有三个满足上面的条件速度快, 但对噪音不鲁棒 Edward Rosten, Tom Drummond. Machine Learning for High-Speed Corner Detection. ECCV (1) 2006: 430-443.

SIFT Scale-Invariant Feature Transform SIFT 通过在不同级别的图像 DoG 上寻找极大 / 极小值来确定特征的位置和对应的尺度, 后续的特征提取在与其尺度最邻近的图像 DoG 上进行这使它有良好的尺度不变性 David G. Lowe.Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision 60(2): 91-110 (2004).

基于不变量的特征 SIFT 之后陆续出现了各种尺度不变特征描述量提取算法如 RIFT GLOH SURF 等其中 SURF 性能上接近 SIFT SURF 使用了 Haar 小波卷积替代 SIFT 中的高斯核用积分图像进行了加速, 使得计算速度达到 SIFT 的 3~7 倍 ORB 由于其良好的匹配性能和极快的提取速度也得到了广泛使用

特征提取精度高 SIFT SURF ORB 速度快 SIFT 极佳的尺度不变性, 能一定程度上适应视角变化和亮度变化 SURF 能够处理严重的图像模糊, 速度要高于 SIFT, 但精度不如 SIFT ORB 极快的提取速度, 在实时应用中常用来替代 SIFT 以上三种特征提取算法均在 OpenCV 中有实现

特征匹配模板匹配直接在目标图像中寻找给定的图像块

特征匹配在小运动假设下, 可以采用 KLT 跟踪方法 : I(x,y,t) I(x,y,t+1) 一个等式, 两个未知量

特征匹配进一步假设 : 相邻像素运动一致 ( 单个像素 ) ( 邻域窗口 )

特征匹配大运动情况下的匹配通过比较特征描述量的距离进行匹配 SIFT = 128 维 SURF = 64 维 ORB = 256bits 暴力匹配快速最近邻匹配 OpenCV 中提供了相应的匹配算法

Loopback Sequences and Multiple Sequences How to efficiently match the common features among different subsequences?

Non-Consecutive Feature Tracking

Framework Overview 1. Detect SIFT features over the entire sequence. 2. Consecutive point tracking: 2.1 Match features between consecutive frames with descriptor comparison. 2.2 Perform the second-pass matching to extend track lifetime. 3. Non-consecutive track matching: 3.1 Use hierachical k-means to cluster the constructed tracks. 3.2 Estimate the matching matrix with the grouped tracks. 3.3 Detect overlapping subsequences and join the matched tracks.

Two-Pass Matching for Consecutive Tracking SIFT Feature Extraction First-Pass Matching by Descriptor Comparison Global distinctive

Two-View Geometry 3D???

Two-View Geometry 3D: Epipolar Geometry

Not enough! How to handle image distortion? Naïve window-based matching becomes unreliable! How to give a good position initializaton? Whole line searching is still time-consuming and ambiguous with many potential correspondences.???

Second-Pass Matching by Planar Motion Segmentation Estimate a set of homographies Using inlier matches in first-pass matching frame t 1 2 H t H, t1 t, t1 frame t+1 Alignment 3 H t, t1 4 H t, t1

Second-Pass Matching by Planar Motion Segmentation Guided matching Epipolar constraint Homography constraint

Second-Pass Matching with Multi- Homographies First-Pass Matching (53 matches) Direct Searching (11 matches added) Our Second-Pass Matching (346 matches added)

Non-Consecutive track matching Fast Matching Matrix Estimation Detect overlapping subsequences and join the matched tracks.

Fast Matching Matrix Estimation Each track has a group of description vectors Track descriptor Use a hierarchical K-means approach to cluster the track descriptors

Fast Matching Matrix Estimation

Non-Consecutive Track Matching Simultaneously Match Images and Refine Matching Matrix Refine the matching matrix after matching the common features of the selected image pairs. More reliably find the best matching images with the updated matching matrix.

Traditional SfM Framework Feature tracking over whole sequence Structure & motion initialization Compute F between two initial images Compute P 1 and P 2 Triangulate 3D points of the matched features For each additional view Compute the camera pose Refine and extend 3D points Self-Calibration Upgrade the projective reconstruction to metric one. Refine structure and motion Bundle adjustment

三角化已知 F, 计算 P 和 P 已知 x 和 x 计算 X: x= PX x'= P 'X Richard Hartley and Andrew Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, Second Edition 2004.

有噪声情况下的三角化由于存在噪声, 反投到三维空间上的射线并不会严格相交优化投影点到对应极线的距离 Richard Hartley and Andrew Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, Second Edition 2004.

线性三角化方法给定方程 x= PX x'= P 'X p it 表示 P 的第 i 行. 写成矩阵和向量相乘的形式直接解析求解. 没有几何意义不是最优.

优化几何误差 Cost function 用 Levenberg-Marquart 算法求解

Knowing 3D points, Compute Camera Motion Compute Projection Matrix Decomposition for Metric Projection Matrix P K[ R t] [ KR Kt] [ M Kt] Decompose M into K, R by QR decomposition 1 t K p, p, p ) ( 14 24 34 T

Bundle Adjustment Definition Refining a visual reconstruction to produce jointly optimal 3D structure and viewing parameter (camera pose and/or calibration) estimates. arg min C 1,...C Nc,X 1,...,X Np å p(x i,c j )- x ij 2 B. Triggs, P. F. McLauchlan, R. I. Hartley, and A. W. Fitzgibbon. Bundle adjustment - a modern synthesis. In Workshop on Vision Algorithms, pages 298-372, 1999.

Geometric Ambiguities Projective Self-Calibration Metric Reconstruction Reconstruction Marc Pollefeys. Visual 3D Modeling from Images

Self-Calibration State-of-the-Art References R.I. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision, second ed. Cambridge Univ. Press, 2004. M. Pollefeys, L.J. Van Gool, M. Vergauwen, F. Verbiest, K. Cornelis, J. Tops, and R. Koch, Visual Modeling with a Hand-Held Camera, Int l J. Computer Vision, vol. 59, no. 3, pp. 207-232, 2004. G. Zhang, X. Qin, W. Hua, T.-T. Wong, P.-A. Heng, and H. Bao, Robust Metric Reconstruction from Challenging Video Sequences, Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, 2007.

推荐 SfM 开源系统 ENFT-SFM or LS-ACTS http://www.zjucvg.net/ls-acts/ls-acts.html OpenMVG https://github.com/openmvg/openmv VisualSFM http://ccwu.me/vsfm/

视觉 SLAM 章国锋浙江大学 CAD&CG 国家重点实验室

SLAM: 同时定位与地图构建机器人和计算机视觉领域的基本问题在未知环境中定位自身方位并同时构建环境三维地图广泛的应用增强现实虚拟现实机器人无人驾驶航空航天

SLAM 常用的传感器红外传感器 : 较近距离感应, 常用于扫地机器人激光雷达深度传感器摄像头 : 单目双目多目惯性传感器 ( 英文叫 IMU, 包括陀螺仪加速度计 ): 智能手机标配激光雷达常见的单目摄像头普通手机摄像头也可作为传感器双目摄像头微软 Kinect 彩色 - 深度 (RGBD) 传感器手机上的惯性传感器 (IMU)

SLAM 的运行结果设备根据传感器的信息计算自身位置 ( 在空间中的位置和朝向 ) 构建环境地图 ( 稀疏或者稠密的三维点云 ) 稀疏 SLAM 稠密 SLAM

SLAM 系统常用的框架 RGB 图深度图 IMU 测量值输入传感器数据前台线程根据传感器数据进行跟踪求解, 实时恢复每个时刻的位姿输出设备实时位姿三维点云优化以减少误差累积后台线程进行局部或全局优化, 减少误差累积场景回路检测回路检测

Related Work Filter-based SLAM Davison et al.2007 (MonoSLAM), Eade and Drummond 2006, Mourikis et al. 2007 (MSCKF), Keyframe-based SLAM Klein and Murray 2007,2008 (PTAM), Castle et al.2008, Tan et al. 2013 (RDSLAM), Mur-Artal et al. 2015 (ORB-SLAM), Liu et al. 2016 (RKSLAM), Direct Tracking based SLAM Engel et al. 2014 (LSD-SLAM), Forster et al. 2014 (SVO), Engel et al. 2018 (DSO)

Extended Kalman Filter State at time k, model as multivariate Gaussian x N( xˆ, P ) k State transition model x f x ) w k ~ k k mean covariance ( k 1 wk ~ N(0, Qk ) State observation model z h( x ) v v k k k ~ N(0, Rk ) k k Process noise Observation noise

Extended Kalman Filter Predict xˆ P F k k 1 k 1 k 1) k k 1 k Update f F ( xˆ k P f x k 1 k 1 ˆ x k 1 k1 kk 1 F T k S H P H R T k k k k 1 k k K P H S T 1 k k k1 k k Q xˆ xˆ K ( z h( xˆ )) k k k k 1 k k k k 1 P ( I K H ) P k k k k k k 1 H h x k xˆ k Innovation covariance

MonoSLAM A. J. Davison, N. D. Molton, I. Reid, and O. Stasse. MonoSLAM: Realtime single camera SLAM. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 29(6):1052-1067, 2007. Map representation 2 1 X X C X C x camera state point state 2 2 1 2 2 2 1 1 1 1 2 1 X X X X C X X X X X C X CX CX CC P P P P P P P P P P

MonoSLAM Complexity 3 O( N ) per frame Scalability Hundreds of points

PTAM: Parallel Tracking and Mapping Map representation G. Klein and D. W. Murray. Parallel Tracking and Mapping for Small AR Workspaces. In Proceedings of the International Symposium on Mixed and Augmented Reality (ISMAR), 2007.

PTAM: Parallel Tracking and Mapping Overview Feature Extraction Feature Tracking Foreground Thread Camera Pose Estimation New Keyframe? Map yes 3D Points Keyframes Background Thread Bundle Adjustment Add New 3D Points

Keyframe-based SLAM vs Filtering-based SLAM Advantages Accuracy Efficiency Scalability Disadvantages Sensitive to strong rotation Challenges for both Fast motion Motion blur Insufficient texture H. Strasdat, J. Montiel, and A. J. Davison. Visual SLAM: Why filter? Image and Vision Computing, 30:65-77, 2012.

ORB-SLAM: A Versatile and Accurate Monocular SLAM System Raul Mur-Artal, J. M. M. Montiel, Juan D. Tardós: ORB-SLAM: A Versatile and Accurate Monocular SLAM System. IEEE Trans. Robotics 31(5): 1147-1163 (2015).

ORB-SLAM: A Versatile and Accurate Monocular SLAM System 基本延续了 PTAM 的算法框架, 但对框架中的大部分组件都做了改进选用 ORB 特征, 匹配和重定位性能更好. 加入了循环回路的检测和闭合机制, 以消除误差累积. 通过检测视差来自动选择初始化的两帧. 采用一种更鲁棒的关键帧和三维点的选择机制.

Direct Tracking Thomas Schops, Jakob Engel, Daniel Cremers: Semi-dense visual odometry for AR on a smartphone. ISMAR 2014: 145-150.

Direct Tracking Goal Estimate the camera motion by aligning intensity images and with depth map of Assumption I1 2 I1( x) I2( (, x, Z1( x))) Z I 1 I 1 warping function: maps a pixel from I 1 to I 2

Residual of the k-th pixel Posteriori likelihood Direct Tracking ) ( ))) (,, ( ( ) ( 1 1 2 k k k k x I x Z x w I r ) ( ) ( ) ( ) ( ) ( ) ( ) ( r p p r p r p p r p r p k k

Semi-Dense Visual Odometry Jakob Engel, Jürgen Sturm, Daniel Cremers: Semi-dense Visual Odometry for a Monocular Camera. ICCV 2013: 1449-1456

Semi-Dense Visual Odometry Keyframe representation ) ( ) ( ) ( ),, ( 2 x V x D d x I i V D I K i d i i i i i i i i i image intensity inverse depth inverse depth variance

Semi-Dense Visual Odometry Overview

LSD-SLAM After loop closure Before loop closure Jakob Engel, Thomas Schops, Daniel Cremers: LSD-SLAM: Large-Scale Direct Monocular SLAM. ECCV (2) 2014: 834-849.

LSD-SLAM Map representation Pose graph of keyframes Node: keyframe K I, D, V ) i ( i i i Edge: similarity transformation ji sim(3)

LSD-SLAM Overview

LSD-SLAM Direct sim(3) image alignment ),1/, ( ) ( ) ( ) ( ) ( ) ( )),1/ (, ( 1/ ), ( 2 ) ( )),1/, ( ( ), ( ), ( ), ( arg min 2 2 2 ), ( 1 2 2 2 2 ), ( 2 ), ( 2 2 ), ( 2 * 2 2 2 2 i ji i d i j d j p r j i ji Z ji d d i p I p r i i ji j ji p p p r ji d p r ji p ji d p p p D r p V p D r p V p D d p T p r d r p I d p I p r p r p r ji d i ji p ji d ji p ji

LSD-SLAM Pose graph optimization Energy function: Kummerle, R., Grisetti, G., Strasdat, H., Konolige, K., Burgard, W.: g2o: A general framework for graph optimization. In: Intl. Conf. on Robotics and Automation(ICRA) (2011)

Key Issues for SLAM in Dynamic Environments Gradually changing

Key Issues for SLAM in Dynamic Environments Gradually changing Object Occlusion Viewpoint Change Dynamic Objects

Key Issues for SLAM in Dynamic Environments Gradually changing Object Occlusion Viewpoint Change Dynamic Objects Very low inlier ratio

RDSLAM Framework

Online 3D Points and Keyframes Updating Keyframe representation 3D Change detection Select 5 closest keyframes for online image. For each valid feature point x in each selected keyframe, Compute its projection x in current frame If difference Since dynamic points cannot be triangulated, the occlusion caused by dynamic objects can be excluded here., compute the appearance If, then find a set of feature points y close to x.

Occlusion Handling

Random Sample Consensus (RANSAC) [Fischler and Bolles, 1981] Objective: Robust fit of a model to a data set S which contains outliers. Step 1. Compute a set of potential matches Step 2. While T(#inliers, #samples) < 95% do step 2.1 select minimal sample (6 matches) step 2.2 compute solutions for P step 2.3 determine inliers Step 3. Refine P based on all inliers

Prior-based Adaptive RANSAC Sample generation 10x10 bins Prior probability p i Hypothesis evaluation det( C) s ( i ) i A Inliers number N i i Inliers distribution, i.e., distribution ellipse C * / * i j j

Prior-based Adaptive RANSAC Hypothesis evaluation s ( i det( C) i ) A =24.94 =21.77 200 green points on the static background, 300 cyan points on the rigidly moving object, 500 red points are randomly moving.

Prior-based Adaptive RANSAC Hypothesis evaluation s ( i det( C) i ) A =24.94 S1 = 8.31 > S2 = 1.98 =21.77 200 green points on the static background, 300 cyan points on the rigidly moving object, 500 red points are randomly moving.

Comparison

Results and Comparison

Visual-Inertial SLAM Use IMU data to improve robustness Filtering-based methods MSCKF, SLAM in Project Tango, ARCore, ARKit Non-linear optimization based methods OKVIS, VINS, Can work without real IMU data?

RKSLAM Framework Multi-Homography based Tracking Global homography Specific Homography Local Homographies Sliding-window based pose optimization Use global image alignment to estimate rotational velocity Pose optimization with simulated IMU data

Sliding-Window based Pose Optimization Assume having IMU data Set and estimate by

Results and Comparisons

Quantitative Evaluation with TUM RGB-D Dataset From left to right: RMSE (cm) of keyframes, the starting ratio (i.e. dividing the initialization frame index by the total frame number), and the tracking success ratio after initialization. Group A: simple translation Group C: slow and nearly pure rotation Group B: there are loops Group D: fast motion with strong rotation

Timing Computation Time on a desktop PC For a mobile device 20~50 fps on an iphone 6.

Robust Keyframe-based Dense SLAM with an RGB-D Camera https://arxiv.org/abs/1711.05166

RKD-SLAM 系统框架非常快速鲁棒的基于 RGB-D 的跟踪方法 ( 单 CPU 下约 70-200 fps) 非常快速的增量集束调整算法非常高效的基于关键帧的深度表达和融合方法支持快速运动回路闭合重定位和长时间运行

Efficient Incremental BA 提出了一个非常高效的 Incremental Schur complement 计算方法 ; 采用 Preconditioned Conjugated Gradient 进行求解, 比 Factorization 的方法要快 ; 速度比 isam2 快一个数量级

Efficient Incremental BA 与 isam2 的对比运行时间 Reprojection Error

Keyframe-based Fusion 对于新来的一帧 F i 如果是关键帧则 integrate 到 TSDF 上如果是非关键帧, 则选出重合度最大的关键帧 F ki 进行 de-integrate. 然后将该帧深度融合到 F ki 上然后将融合后的关键帧 re-integrate 到 TSDF 上

Keyframe-based Fusion 当关键帧的姿态发生改变 (EIBA 优化后 ) 根据 EIBA 的优化结果, 对姿态改变的关键帧进行 re-integration. 维护一个关键帧更新队列优先更新姿态改变最大的关键帧 ; 每个时刻只对固定数量的关键帧进行 re-integration, 没有更新的关键帧会在放在后面的时刻更新

Comparison of ATE RMSE on all of the sequences on TUM RGB-D Benchmark

鲁棒处理快速运动

在线的回路闭合和三维表面调整

各类单目 V-SLAM 系统比较

典型应用三维重建视频分割与编辑增强现实

三维重建

视频分割与编辑

增强现实

Visual SLAM 技术发展趋势 (1) 缓解特征依赖基于边的跟踪直接图像跟踪或半稠密跟踪结合机器学习和先验 / 语义信息稠密三维重建单 / 多目实时三维重建基于深度相机的实时三维重建平面表达和模型自适应简化

Visual SLAM 技术发展趋势 (2) 多传感器融合结合 IMU GPS 深度相机光流计里程计

我们的 SLAM 系统 RDSLAM http://www.zjucvg.net/rdslam/rdslam.html RKSLAM http://www.zjucvg.net/rkslam/rkslam.html 更多系统未来会放出来 http://www.zjucvg.net

推荐开源系统 PTAM https://github.com/oxford-ptam/ptam-gpl ORB-SLAM https://github.com/raulmur/orb_slam LSD-SLAM https://github.com/tum-vision/lsd_slam DSO https://github.com/jakobengel/dso SVO https://github.com/uzh-rpg/rpg_svo

Open-source Solver & BA g2o: https://github.com/rainerkuemmerle/g2o GTSAM& isam: https://bitbucket.org/gtborg/gtsam/ Ceres Solver: http://ceres-solver.org/ Bundler: http://www.cs.cornell.edu/~snavely/bundler/ PBA: https://grail.cs.washington.edu/projects/mcba/ EIBA: the source code will be released soon. http://www.zjucvg.net

Thank you!

视频场景重建的流程 运动恢复结构 深度恢复 三维重建

视频场景重建的流程运动恢复结构深度恢复三维重建