pISSN 1897-8649 (PRINT) / eISSN 2080-2145 (ONLINE)
VOLUME 6
N째 1
2013
www.jamris.org
JOURNAL of AUTOMATION, MOBILE ROBOTICS & INTELLIGENT SYSTEMS
Editor-in-Chief Janusz Kacprzyk
Executive Editor: Anna Ładan aladan@piap.pl
(Systems Research Institute, Polish Academy of Sciences; PIAP, Poland)
Associate Editors: Jacek Salach (Warsaw University of Technology, Poland) Maciej Trojnacki (Warsaw University of Technology and PIAP, Poland)
Co-Editors: Oscar Castillo (Tijuana Institute of Technology, Mexico)
Dimitar Filev
Statistical Editor: Małgorzata Kaliczyńska (PIAP, Poland)
(Research & Advanced Engineering, Ford Motor Company, USA)
Kaoru Hirota Editorial Office: Industrial Research Institute for Automation and Measurements PIAP Al. Jerozolimskie 202, 02-486 Warsaw, POLAND Tel. +48-22-8740109, office@jamris.org
(Interdisciplinary Graduate School of Science and Engineering, Tokyo Institute of Technology, Japan)
Witold Pedrycz (ECERF, University of Alberta, Canada)
Roman Szewczyk (PIAP, Warsaw University of Technology, Poland)
Copyright and reprint permissions Executive Editor
Editorial Board: Chairman: Janusz Kacprzyk (Polish Academy of Sciences; PIAP, Poland) Mariusz Andrzejczak (BUMAR, Poland) Plamen Angelov (Lancaster University, UK) Zenn Bien (Korea Advanced Institute of Science and Technology, Korea) Adam Borkowski (Polish Academy of Sciences, Poland) Wolfgang Borutzky (Fachhochschule Bonn-Rhein-Sieg, Germany) Chin Chen Chang (Feng Chia University, Taiwan) Jorge Manuel Miranda Dias (University of Coimbra, Portugal) Bogdan Gabryś (Bournemouth University, UK) Jan Jabłkowski (PIAP, Poland) Stanisław Kaczanowski (PIAP, Poland) Tadeusz Kaczorek (Warsaw University of Technology, Poland) Marian P. Kaźmierkowski (Warsaw University of Technology, Poland) Józef Korbicz (University of Zielona Góra, Poland) Krzysztof Kozłowski (Poznań University of Technology, Poland) Eckart Kramer (Fachhochschule Eberswalde, Germany) Piotr Kulczycki (Cracow University of Technology, Poland) Andrew Kusiak (University of Iowa, USA) Mark Last (Ben–Gurion University of the Negev, Israel) Anthony Maciejewski (Colorado State University, USA) Krzysztof Malinowski (Warsaw University of Technology, Poland) Andrzej Masłowski (Warsaw University of Technology, Poland)
Patricia Melin (Tijuana Institute of Technology, Mexico) Tadeusz Missala (PIAP, Poland) Fazel Naghdy (University of Wollongong, Australia) Zbigniew Nahorski (Polish Academy of Science, Poland) Antoni Niederliński (Silesian University of Technology, Poland) Witold Pedrycz (University of Alberta, Canada) Duc Truong Pham (Cardiff University, UK) Lech Polkowski (Polish-Japanese Institute of Information Technology, Poland) Alain Pruski (University of Metz, France) Leszek Rutkowski (Częstochowa University of Technology, Poland) Klaus Schilling (Julius-Maximilians-University Würzburg, Germany) Ryszard Tadeusiewicz (AGH University of Science and Technology in Cracow, Poland)
Stanisław Tarasiewicz (University of Laval, Canada) Piotr Tatjewski (Warsaw University of Technology, Poland) Władysław Torbicz (Polish Academy of Sciences, Poland) Leszek Trybus (Rzeszów University of Technology, Poland) René Wamkeue (University of Québec, Canada) Janusz Zalewski (Florida Gulf Coast University, USA) Marek Zaremba (University of Québec, Canada) Teresa Zielińska (Warsaw University of Technology, Poland)
Publisher: Industrial Research Institute for Automation and Measurements PIAP
If in doubt about the proper edition of contributions, please contact the Executive Editor. Articles are reviewed, excluding advertisements and descriptions of products. The Editor does not take the responsibility for contents of advertisements, inserts etc. The Editor reserves the right to make relevant revisions, abbreviations and adjustments to the articles.
All rights reserved ©
1
JOURNAL of AUTOMATION, MOBILE ROBOTICS & INTELLIGENT SYSTEMS VOLUME 6, N° 1, 2013
CONTENTS Guest Editors: Piotr Skrzypczynski, Andrzej Kasinski, Gurvinder S. Virk.
35
3
Editorial
Toward rich geometric map for SLAM: Online Detection of Planes in 2D LIDAR Cyrille Berger
5
On augmenting the visual slam with direct orientation measurement using the 5-point algorithm Adam Schmidt, Marek Kraft, Michał Fularz, Zuzanna Domagała
11
Comparative assessment of point feature detectors in the context of robot navigation Adam Schmidt, Marek Kraft, Michał Fularz, Zuzanna Domagała
42
Understanding 3D shapes being in motion Janusz Bedkowski
47
Interactive mapping using range sensor data under localization uncertainty Pedro Vieira, Rodrigo Ventura
54 21
Simultaneous localization and mapping for tracked wheel robots combining monocular and stereo vision Filipe Jesus, Rodrigo Ventura
28
Absolute localization for low capability robots in structured environments using barcode landmarks Duarte Dias, Rodrigo Ventura
2
Articles
Assisted Teleoperation of Quadcopters Using Obstacle Avoidance João Mendes, Rodrigo Ventura
Journal of Automation, Mobile Robotics & Intelligent Systems
VOLUME 7,
N° 1
2013
Editorial
This special section of JAMRIS, entitled EXTENDING AUTONOMY OF MOBILE ROBOTS: LOCALIZATION AND MAPPING contains eight papers that were presented at the International Workshop on Perception for Mobile Robots Autonomy PEMRA 2012, which was held in Poznan, Poland on September 27 and 28, 2012. The idea of publishing the PEMRA papers in JAMRIS instead of preparing a workshop proceedings emerged early during the planning phases of the workshop, which was organized by the Institute of Control and Information Engineering, Poznan University of Technology under the auspices of the CLAWAR Association Limited. PEMRA was intended to provide a venue where researchers can gather to present and discuss the latest achievements, future challenges, and exciting applications of perception systems for mobile robots in general, and walking robots in particular. The call for papers encouraged contributions dealing with the localisation and navigation fundamentals of mobile robotics in unstructured and dynamic environments; these covered topics in sensing, mapping, self-localization, computational architectures, active perception, multi-sensor and multi-robot systems. Nowadays robotics research is more and more internationalized, and publication of the papers in an international journal improves the reception in the world-wide community of roboticists, and fosters the rapid exchange of results and ideas. However, it should be noted, that the papers contained in this special section were subjected to the regular JAMRIS review procedure, and are published here in improved, often significantly extended versions. We would like to express our gratitude to all of the reviewers for their time and effort in evaluating the final papers. Preparing these final papers for JAMRIS the authors had also the opportunity to take into account many questions, comments, and proposed amendments that have emerged during the extensive discussions taking place at the PEMRA workshop after each presentation. Fifteen papers were presented at the workshop andthis special section contains eight of them which focus on key issues inrobotic perception, namely, how to model the surrounding environment,, sensors and sensing systems needed, and how to localize the robots in their operational environments. The key areas of building a map of the environment, self-localizing in the map realised, and the combined problem of simultaneous localization and mapping (SLAM) have received significant attention in the research literature. Nowadays, we should ask if these problems are already solved and that we should move to addressing other issues in mobile rotics? From one side the answer could be ”yes”, as substantial progress was made over the past 10 years in both mapping more and more complicated environments, and in developing various self-localization and SLAM algorithms, which can deal with larger scenes and more complex tasks. However, from another side the answer could be ”no”, as we still lack sufficiently robust, reliable self-localization or SLAM implementations in practical real-world mobile robotic system applications, and we still need more precise and faster mapping methods. These problems manifest themselves in more complicated robotic system applications, such like walking or aerial robots, but also in simple, low-cost robots, if we cannot deploy enough computing power and/or precise sensing solutions to allow standard methods to work. The papers presented in this special section address some of those issues. Schmidt, Kraft, Fularz, and Domagala begin by describing an approach to augment a monocular visual SLAM system with the direct, visual measurement of the robot orientation change using the 5-point algorithm. They show that this modification allows to reduce the tracking error of this simple SLAM setup by over 30%. The visual SLAM theme is continued in the second paper by the same team of authors. This time they present the evaluation of various contemporary interest point detectors and descriptor pairs in the context of mobile robot navigation, using publicly available datasets. Jesus and Ventura also address the issues of visual SLAM in their paper which describes an on-line 6D SLAM method for a tracked robot in an unknown and unstructured environment. The environment is mapped using natural landmarks in 3D space, autonomously collected using visual data from feature sensing detectors. A dimensional-bounded EKF (DBEKF) is
Articles
3
Journal of Automation, Mobile Robotics & Intelligent Systems
VOLUME 7,
N° 1
2013
introduced in the paper, which keeps the dimension of the state bounded. All the experimental work was done using real data from a tracked wheel robot developed for search and rescue missions. The next paper by Dias and Ventura shifts the reader’s attention to problems of low-cost mobile robots. Self-localization for low capability robots is challenging due to the inherent processing power restrictions and sensory system performance limitations. This paper proposes a complete onboard absolute self-localization system for such robots, using the EKF to fuse odometry readings with absolute pose estimates, obtained from barcode observations using a simple camera. Experimental results using an e-puck robot with only 8KB of RAM and a 16MIPs processor are presented. The work by Berger next shows how to exploit a simple 2D sensor system in order to obtain a compact representation of large 3D environments in SLAM. In this paper a method is presented, that relies on the idea that a plane appears as a line segment in a 2D scan, and that by tracking those lines frame after frame, it is possible to estimate the parameters of that plane. The next paper by Bedkowski considers a robotic system enabling the use of rich RGB-D three-dimensional images and GPGPU-based computations. The paper’s goal is to develop a system with real-time capabilities to distinguish basic shapes in the robot’s operational environment (corners, planes, cones, spheres etc.) that are moving in front of the RGB-D sensor. The proposed approach can be a useful starting point for further developments of methods to recognize real-world objects in dynamic environments. Next, Vieira and Ventura propose in their paper to employ the user of a robotic system in the map building process. They propose a semi-automatic approach, denoted interactive mapping, involving a human operator in the process of detecting and correcting erroneous matches between 2D scans. Instead of allowing the operator complete freedom in correcting the matching on a frame by frame basis, the proposed method facilitates the adjustment along the directions with more ambiguity, while constraining the others. The final paper by Mendes and Ventura concerns the teleoperation of an unmanned aerial vehicle in unknown, GPSdenied environments. However, the paper is focused on building a rough map of the nearby environment using sonar data and the FastSLAM algorithm. This allows tracking of the vehicle pose with respect to the map, and using the map to override operator commands that may lead to collisions. We hope that this selection of papers shows PEMRA as an interesting and high quality scientific event, open to various topics related to perception in autonomous mobile robots. These papers reveal also that mapping, self-localization and SLAM are still very dynamic areas of the robotics research, however, the focus is gradually shifting from fundamental algorithmic developments into tailoring the specific solutions for particular robots, sensors, and applications.
Piotr Skrzypczynski, Poznan University of Technology, Poland Andrzej Kasinski, Poznan University of Technology, Poland Gurvinder S. Virk, University of Gävle, Sweden
4
Articles
Journal of Automation, Mobile Robotics & Intelligent Systems
N◦ 1
VOLUME 7,
2013
On Augmenting the Visual SLAM with Direct Orientation Measurement Using the 5-Point Algorithm Received 10th October 2012; accepted 22nd November 2012.
Adam Schmidt, Marek Kraft, Michał Fularz, Zuzanna Domagała Abstract: This paper presents the attempt to merge two paradigms of the visual robot navigation: Visual Simultaneous Localization and Mapping (VSLAM) and Visual Odometry (VO). The VSLAM was augmented with the direct, visual measurement of the robot orientation change using the 5-point algorithm. The extended movement model of the robot was proposed and additional measurements were introduced to the SLAM system. The efficiency of the 5-point and 8-point algorithms was compared. The augmented system was compared with the state of the art VSLAM solution and the proposed modification allowed to reduce the tracking error by over 30%. Keywords: visual SLAM, visual odometry
1. Introduction The ability to work in the unexplored environment plays a crucial role in the operation of a mobile robot. Recently an increasing attention has been paid to the visual navigation systems due to the decreasing price and increasing quality of the cameras, relatively simple mathematical models and high information density of images. The main drawback of such an approach is the inability to measure the depth of the unknown scene using a single camera. The visual navigation systems can be divided into two main categories: visual simultaneous localization and mapping (VSLAM) and visual odometry (VO). The VSLAM methods focus on building a map of an unknown environment and use it to calculate the position of the robot in real-time. As they are mostly used for tracking long trajectories the feature map is relatively sparse and only few measurements are used to update the estimate of environment state. The first real-time VSLAM system was proposed by Davison and Murray [1]. Sim et al. [2] presented a large scale VSLAM system using the SIFT point detector and descriptor and the particle filter. Sturm and Visser [3] showed the visual compass allowing for the fast estimation of the robot’s orientation. At the moment the MonoSLAM [4] is considered to be one of the most successful VSLAM systems. A modification of the MonoSLAM adapted for the hexapod robot was presented in [5]. The main purpose of the VO methods is to estimate the precise trajectory of the robot without the map of the environment. The in-depth survey of the VO algorithms has been recently presented in [6, 7]. The VO systems use relatively large number of measurements to estimate the transformation between the consecutive robot positions. Due to the necessary processing the VO algorithms tend to be slower, though there have been some attempts to increase their speed with the FPGA implementation [8].
This paper presents the fusion of both the approaches in order to improve the precision of the VSLAM system by augmenting it with the visual estimation of the robot’s orientation change. The orientation change is measured using the 5-point algorithm [9] every few steps of the VSLAM system. The proposed algorithm was evaluated using the sequences registered during the Rawseeds Project [10]. The augmented VSLAM system is presented in the section 2 and the section 3 presents the concluding remarks.
2. Augmented VSLAM system 2.1. Environment The environment was modeled using the probabilistic, feature-based map adapted from the works of Davison [1, 4]. The map contains the information on the current estimates of the robot and features position as well as the uncertainty of those estimates. The state vector x contains the state of the robot and the features while the covariance matrix P represents the state uncertainty modeled with multidimensional Gaussian: Pxr xr Pxr x1f Pxr x2f . . . xr P 1 x1f xf xr Px1f x1f Px1f x2f . . . 2 x = x ,P = P 2 Px2f x1f Px2f x2f . . . xf xr f .. .. .. .. .. . . . . . (1) where xr is the state of the robot and xif is the state vector of the i-th feature. The Extended Kalman Filter is used to update the probabilistic map. The basic movement model, the ’agile camera’ was proposed by Davison [1, 4]. The state vector xr consists of the robot’s Cartesian position r, the quaternion q representing the robot’s orientation, linear velocity v and angular velocity ω: xr =
r
q
v
ω
T
(2)
The robot’s model was extended with the additional quaternion qm describing the orientation of the robot remembered during the system initialization or the last measurement of the orientation change: xm r =
xr
qm
T
(3)
The Inverse Depth representation of the point features was used [11] where the state of each feature consists of the camera position during the feature initialization, angles describing the bearing of the line passing through the point feature and camera center and the inverse of the depth: xif =
xi0
y0i
z0i
φi
θi
ρi
T
(4)
Articles
5
Journal of Automation, Mobile Robotics & Intelligent Systems
VOLUME 7,
N◦ 1
2013
2.2. Movement model and prediction During the prediction stage of the EKF it is assumed that the robot is the only dynamic element of the environment. Therefore, the estimates of the point features’ positions do not change during the prediction stage. At each iteration the robot is affected by random, normally distributed linear (a) and angular (α) accelerations resulting in a velocity impulse: V (k) a(k) = ∆T (5) Ω(k) α(k) The prediction function of the ’agile camera’ model takes the following form: xr (k + 1|k) = f (xr (k|k), ∆T, a(k), α(k)) r(k|k) + (v(k) + V (k))∆T q(k|k) × qω (k) = v(k|k) + V (k) ω(k|k) + Ω(k)
Fig. 1. The exemplary point features measurements with the uncertainty ellipses.
(6)
where qω (k) is the incremental rotation quaternion and × stands for the Grassmann product. The prediction function of the new model takes two forms depending on the orientation measurement state. If the orientation change was not measured in the last iteration the qm remains unchanged (Eq. 7). Otherwise its value is replaced with the current estimate of the robot orientation (Eq. 8). m xm (7) r (k + 1|k) = f1 (xr (k), ∆T, a(k), α(k)) = f (xr (k|k), ∆T, a(k), α(k)) = qm (k) m xm (8) r (k + 1|k) = f2 (xr (k), ∆T, a(k), α(k)) = f (xr (k|k), ∆T, a(k), α(k)) = q(k)
2.3. Measurement In the basic version of the MonoSLAM the state vector is updated according to the visual observations of the point features [4] (Figure 1). In the proposed system the observation vector is extended with the visual measurement of the robot orientation change. The observation vector h takes the following form: h=
h1
...
hN
qh
T
(9)
T where hi = ui vi stands for the observation of the i-th point feature on the image plane and q h is the quaternion describing the orientation change defined as: q
h
m ∗
(q )
m ∗
(q ) × q m qa −qbm = =
−qdm
T
(11)
2.4. Orientation change estimation In computer vision, the fundamental matrix F and the essential matrix E are the rank 2 matrices of 3 × 3 size, that relate the corresponding point pairs across two views of Articles
x0T Fx = 0
(12)
The essential matrix E is related to the fundamental matrix F as given in (13). E = K0T FK
(13)
The matrices K and K0 are the camera calibration matrices, so for the monocular case K = K0 . The knowledge of essential matrix allows for the determination of relative pose between camera positions at which the images were registered, i.e. the rotation vector R and the translation vector t (up to an unknown scale). The essential matrix satisfies the equation (12) only if the corresponding point pair coordinates have been normalized. i.e. the raw registered coordinates of points have been multiplied by the respective camera matrices and lens distortions have been corrected. The components of (12) can then be written as: 0 e11 e12 e13 x x E = e21 e22 e23 , x = y , x0 = y 0 . e31 e32 e33 1 1
From this, for a single point pair we get:
(10) −qcm
where × is the Grassmann product and (q m )∗ is the conjugate of the qm . The measurement procedure of qm is presented in the section 2.4.
6
the same scene. If the homogeneous image coordinates of the projection of a scene point X on the first image is given by x, and the projection of the same point on the second image is given by x, then every corresponding point pair x ↔ x0 is tied by the relation given by equation 12 [12].
x0 xe11 + x0 ye12 + x0 e13 + y 0 xe21 + y 0 ye22 +
(14)
0
+y e23 + xe31 + ye32 + e33 = 0. Writing the elements of E as a column, in the row by row order allows to write (14) as (15):
x0 x
x0 y
x0
y0 x
y0 y
y0
x
y
1
e = 0. (15)
Journal of Automation, Mobile Robotics & Intelligent Systems
For any given number of point pairs n, their corresponding equations as given in (15) can be stacked forming a matrix A, resulting in a system as given in equation (16). Ae = 0.
(16)
In practice, the solution to the system of equations given by (16) is found using singular value decomposition (SVD). The decomposition is performed on the matrix A – (SV D(A) = UDVT ). The solution in the least squares sense is then the right singular vector corresponding to the smallest singular value. By using 8 point pairs for the construction of the matrix A, solution is obtained directly from the SVD and is given by the right singular vector corresponding to the smallest singular value. The minimum number of point pairs allowing for the computation of the essential matrix is five, but in this case additional constraints given by the equations (17) and (18) must be taken into account. This is referred to as the 5-point algorithm. det(E) = 0
(17)
2EET E − trace(EET )E = 0
(18)
The method used for essential matrix computation was based on the work presented in [13]. The process starts with performing the SVD for the computation of the 4dimensional nullspace over the matrix A constructed using five point pairs. The essential matrix is a linear combination of the four singular vectors corresponding to the four singular values that are equal to zero: (19)
where ei are the vectors spanning the nullspace, and x, y, z and w are some scalars. As the essential matrix is determined up to scale, w can be substituted with 1. Substituting e to (17) and (18) gives ten 3rd degree polynomial equations, consisting of 20 monomials. The monomials arranged according to GrLex order constitute the X monomial vector, allowing to rewrite the system of equations in the form given in (20). MX = 0
(20)
The matrix M is some 10 × 20 matrix. According to [13], the system of polynomial equations can be solved by defining a so called Gröbner basis and using it for action matrix computation. The action matrix has a size of 10×10. The Gröbner basis is obtained by performing the GaussJordan elimination over (20): I B X=0 (21) The action matrix is constructed by choosing appropri ate columns from the matrix I B after the elimination was performed. The solution of the system of equations is encoded in the left singular vectors of the action matrix corresponding to the real eigenvalues. As the point pairs are detected and matched automatically, the set of matched points contains a certain fraction of outliers (false matches). Even a single incorrect match
2013
used as input data for the 5-point algorithm results in producing incorrect output. To deal with this issue, the 5-point algorithm is applied within the robust estimation framework based on the RANSAC algorithm [14]. The 5-point algorithm has higher computational requirements than the 8-point algorithm. However, the 8-point algorithm requires more RANSAC iterations, especially when the fraction of outliers is high. Furthermore, it cannot deal with the case in which the points used for computation are coplanar (degenerate configuration). To solve for rotation and translation, the method proposed in [12] was used. The SVD decomposition of the essential matrix E is in ideal case given by the equation (23). The singular values s1 and s2 have equal values, and the smallest singular value equals 0, as E is rank-deficient and of rank 2. s1 0 0 SV D(E) = UDVT , where D = 0 s2 0 0 0 0 (22) In practice however, due to the image noise, quantization and numerical inaccuracies, this is not the case. There are three nonzero singular values – one with a value close to zero (s3 ), and the other two with significantly greater value, which are roughly similar (s1 , s2 ). To improve the accuracy of the rotation and translation estimation, rank 2 can be imposed onto the matrix E by setting the lowest singular value to zero and substituting the other two singular values with their average:
s1 +s2 2
0 s1 +s2 ˆ = UDV ˆ ˆ = 0 0 SV D(E) , where D 2 0 0 0 (23) Now, let us introduce two support matrices according to the equation (24): T
e = xe1 + ye2 + ze3 + we4
N◦ 1
VOLUME 7,
0 W= 1 0
−1 0 0
0 0 0 , WT = −1 1 0
0
1 0 0 0 (24) 0 1
If the first camera location is denoted by P, with rotation and translation P = [I|0] (zero rotation expressed by the identity matrix I, zero translation), and the second camera matrix by P, four solutions for rotation and translation exist: P0 = [UWVT |u3 ] P0 = [UWVT | − u3 ] P0 = [UWT VT |u3 ]
(25)
P0 = [UWT VT | − u3 ] The term u3 stands for the 3rd column of U. Out of four solutions given by (Eq. 25) only one is physically correct – the one for which the world points lie in front of both cameras (Figure 2).
3. Results and conclusions The experiments were performed using the data gathered during the Rawseeds Project [10]. The dataset contains Articles
7
Journal of Automation, Mobile Robotics & Intelligent Systems
VOLUME 7,
N◦ 1
2013
Fig. 2. Four possible camera configurations, the upper-left is physically correct.
Fig. 4. The error histograms of the orientation change measurements using the 8-point algorithm.
Fig. 3. The error histograms of the orientation change measurements using the 5-point algorithm. the visual sequences recorded by the onboard cameras of the mobile robot along with the ground truth trajectory of the robot. In the first stage of the experiments the precision of orientation change algorithm was assessed. A set of 250 image pairs with known relative orientation of the camera was used in the experiment. The SURF algorithm was used to detect and describe point features on each image. The point features on each image pair were matched using the brute force matching and the matches were used to estimate the camera ego motion using the 5-point [9] and the 8-point algorithm [12]. The quaternions representing the orientation change were compared with the quaternions obtained from the ground truth data. The Figure 5 presents the results of the feature matching and orientation estimation on two images from the video sequence. The error histograms of the 5-point algorithm are presented on the Figure 3 and the error histograms of the 8-point algorithm are presented on the Figure 4. The ex8
Articles
Fig. 5. Results of the feature matching and orientation estimation: inliers (top) and outliers (bottom).
periment verified the assumed Gaussian character of the orientation measurement error. Moreover, the standard deviation of the measurements obtained with the 5-point algorithm was significantly smaller thus only this algorithm was used to augment the visual SLAM system. In the second stage of the experiments the performance of the SLAM system was evaluated. As no direct stereovision algorithms were used, the scale of the estimated trajectory differs from the real data and was rescaled using the Procrustes analysis in order to estimate the system’s precision. A sequence consisting of 400 images was used and the performance of the original ’agile camera’ model and the proposed model was compared. In order to minimize the correlation between the point feature observations and the orientation change visual sequence from two onboard cameras were used: the first only for the point features observations and the second only for the orientation change estimation. The orientation change was measured every
Journal of Automation, Mobile Robotics & Intelligent Systems
VOLUME 7,
N◦ 1
2013
Fig. 6. The trajectory obtained using the agile camera model: GT trajectory - green, estimated trajectory - blue, estimated trajectory scaled using the Procrustes analysis red.
Fig. 7. The trajectory obtained using the measurable camera model: GT trajectory - green, estimated trajectory blue, estimated trajectory scaled using the Procrustes analysis - red.
4 iterations of the SLAM system. The Figure 6 presents the estimate of the robot’s trajectory obtained with the classic ’agile camera’ model. The average estimation error equaled 0.34 m and the maximal error equaled 0.93 m. The Figure 7 presents the trajectory estimated using the proposed model. The average error was 0.22 m and the maximal error was 0.57 m. This paper presents an attempt to merge two paradigms of the visual navigation: the visual odometry and the visual SLAM. The monocular SLAM system was augmented with visual estimation of the robot orientation change. The experiments showed that the proposed modifications allowed to reduce the average tracking error by 35% and the maximal tracking error by 38%. It the future it is planned to test the performance of other orientation change estimation techniques.
Marek.Kraft@put.poznan.pl. Michał Fularz – Poznań University of Technology, Institute of Control and Information Engineering, ul. Piotrowo 3A, 60-965 Poznań, Poland, e-mail: Michal.Fularz@put.poznan.pl. Zuzanna Domagała – Poznań University of Technology, Institute of Control and Information Engineering, ul. Piotrowo 3A, 60-965 Poznań, Poland, e-mail: Zuzanna.Domagala@cie.put.poznan.pl.
AUTHORS Adam Schmidt∗ – Poznań University of Technology, Institute of Control and Information Engineering, ul. Piotrowo 3A, 60-965 Poznań, Poland, e-mail: Adam.Schmidt@put.poznan.pl. Marek Kraft – Poznań University of Technology, Institute of Control and Information Engineering, ul. Piotrowo 3A, 60-965 Poznań, Poland, e-mail:
∗
Corresponding author
Acknowledgements Adam Schmidt, Marek Kraft and Michał Fularz are scholarship holders within the project ”Scholarship support for PH.D. students specializing in majors strategic for Wielkopolska’s development“, Sub-measure 8.2.2 Human Capital Operational Programme, co-financed by European Union under the European Social Fund.
References [1] A.J. Davison, D.W. Murray, “Simultaneous Localization and Map-Building Using Active Vision”, IEEE Trans. PAMI, vol. 24(7), 2002, pp. 865–880. Articles
9
Journal of Automation, Mobile Robotics & Intelligent Systems
[2] R. Sim, P. Elinas, M. Griffin, “Vision-Based SLAM Using the Rao-Blackwellised Particle Filter”. In: Proc. IJCAI Workshop Reasoning with Uncertainty in Robotics, 2005. [3] J. Sturm, A. Visser, “An appearance-based visual compass for mobile robots”, Robotics and Autonomous Systems, vol. 57(5), 2009, pp. 536–545. [4] A.J. Davison, I. Reid, N. Molton and O. Stasse, “MonoSLAM: Real-Time Single Camera SLAM”, IEEE Trans. PAMI, vol. 29(6), 2007, pp. 1052–1067. [5] A. Schmidt, A. Kasiński, “The Visual SLAM System for a Hexapod Robot”, Lecture Notes in Computer Science, vol. 6375, 2010, pp. 260–267. [6] D. Scaramuzza, F. Fraundorfer, “Visual Odometry: Part I - The First 30 Years and Fundamentals”, IEEE Robotics and Automation Magazine, vol. 18(4), 2011, pp. 80–92. [7] F. Fraundorfer, D. Scaramuzza, “Visual Odometry: Part II - Matching, Robustness and Applications”, IEEE Robotics and Automation Magazine, vol. 19(2), 2012, pp. 78–90. [8] M. Fularz, M. Kraft, A. Schmidt, A. Kasiński, “FPGA Implementation of the Robust Essential Matrix Estimation with RANSAC and the 8-Point and the 5Point Method”, Lecture Notes in Computer Science, vol. 7174, 2012, pp. 60-71. [9] H. Li, R. Hartley, “Five-Point Motion Estimation Made Easy”, in Proc. of the 18th International Conference on Pattern Recognition, vol. 1, 2006, pp. 630– 633. [10] S. Ceriani, G. Fontana, A. Giusti, D. Marzorati, M. Matteucci, D. Migliore, D. Rizzi, D.G. Sorrenti, P. Taddei, “RAWSEEDS ground truth collection systems for indoor self-localization and mapping”, Autonomous Robots Journal, vol. 27(4), 2009, pp. 353– 371. [11] J. Civera, A.J. Davison, J.M.M. Montiel, “Inverse Depth Parametrization for Monocular SLAM”, IEEE Transactions on Robotics, vol. 24(5), 2008, pp. 932– 945. [12] R.I. Hartley, A. Zisserman, “Multiple View Geometry in Computer Vision”, 2nd edition, 2004, Cambridge University Press. [13] H. Stewénius, C. Engels, D. Nistér, “Recent developments on direct relative orientation”, ISPRS J. of Photogrammetry and Remote Sensing, vol. 60(4), 2006, pp. 284–294. [14] M. Fischler and R. Bolles, “Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography”, Comm. of the ACM, vol. 24(6), 1981, pp. 381–395.
10
Articles
VOLUME 7,
N◦ 1
2013
Journal of Automation, Mobile Robotics & Intelligent Systems
VOLUME 7,
N◦ 1
2013
Comparative Assessment of Point Feature Detectors and Descriptors in the Context of Robot Navigation Received 10th October 2012; accepted 22nd November 2012.
Adam Schmidt, Marek Kraft, Michał Fularz, Zuzanna Domagała Abstract: This paper presents evaluation of various contemporary interest point detector and descriptor pairs in the context of robot navigation. The robustness of the detectors and descriptors is assessed using publicly available datasets: the first gathered from the camera mounted on the industrial robot [17] and the second gathered from the mobile robot [20]. The most efficient detectors and descriptors for the visual robot navigation are selected.
suitable for application in the robot navigation both in the context of accuracy. The rest of the paper is structured as follows: Section 2 provides a description of the detectors and descriptors evaluated in the study. Procedure is presented in Section 3. Section 4 contains results while Section 5 contains concluding remarks and the planned future work.
Keywords: point features, detectors, descriptors
2.1. The Shi-Tomasi feature detector The Shi-Tomasi (GFTT) feature detector is relies on investigating the local auto-correlation function of the image intensity function [21]. To perform this, the so called structural tensor is used (Eq. 1).
1. Introduction The detection, description and matching of point features plays a vital role in most of the contemporary algorithms for visual odometry [1, 2] or visual simultaneous localization and mapping [3, 4]. Over the last decade several new fast detectors such as FAST (Features from Accelerated Segment Test) [5], SURF (Speeded Up Robust Features) [6], CenSurE (Center Surround Extrema) being basis for STAR [7] and descriptors e.g. SURF [6], BRIEF (Binary Robust Independent Features) [8], ORB (Oriented FAST and Rotated Brief) [9], FREAK (Fast Retina Keypoint) [10] have been proposed and successfully applied to robot navigation tasks. As the processing speed is the key aspect in such tasks, some of the detectors and descriptors have been either implemented in FPGA (Field Programmable Gate Array) [11, 12] or simplified [13]. At the moment, to the extent of authors knowledge there is no comparative study of the newest point detectors and descriptors with regard to their applicability in robot navigation. In [14] and [15] the authors describe the desired characteristics of point detectors and descriptors, however they do not present any experimental results. The authors in [17] have compared various interest point detectors using sequences recorded with the camera placed on an industrial robot. In the follow up research they compared the detectordescriptor pairs efficiency, however only cross-correlation, SIFT (Scale Invariant Feature Transform) [19] and DAISY descriptors were considered. Another experimental study has been presented in [16] where the detector-descriptor pairs have been graded according to the number of feature matches supporting the 8-point algorithm solution found by the RANSAC algorithm. This indirect evaluation method was forced by difficulties in gathering the ground truth correspondence data for image pairs. This paper presents an evaluation of detector-descriptor pairs in the context of robot navigation. The measure of the pair’s efficiency is based on the reprojection error of point feature pairs matched on two images. The images used were selected from the publicly available datasets [17, 20]. The analysis allows selecting detector and descriptor pair most
2. Detectors and descriptors
A=
XX u
v
w(p, q)
Ix2 Ix Iy
Ix Iy Iy2
(1)
The Ix and Iy terms denote the partial image derivatives in x and y directions, respectively. The term w(p, q) denotes the convolution with a weighting window over the area (p, q). A Gaussian is the most common choice for the weighting window, as it makes the responses isotropic. The feature strength measure Mc is then defined as given in equation 2. Mc = min(|λ1 |, |λ2 |)
(2)
Corners are the local maxima of Mc above some arbitrary threshold t. 2.2. FAST feature detector The FAST [5] feature detector inspects the values of the intensity function of pixels on a circle of the radius r around the candidate point p. The pixel on a circle is considered ’bright’ if its intensity value is brighter by at least t, and ’dark’ if its intensity value is darker by at least t than the intensity value of p, where t is some arbitrary threshold. The candidate pixel is classified as a feature on a basis of a segment test – if a contiguous, at least n pixels long arc of ’bright’ or ’dark’ pixels is found on the circle. The original solution uses r = 3 and n = 9. An illustration of the segment test is given in Fig. 1. The ID3 algorithm is used to optimize the order in which pixels are tested, resulting in high computational efficiency. The segment test alone produces small sets of adjacent positive responses. To further refine the results, an additional cornerness measure is used for non-maximum suppression (NMS). As the NMS is applied to only a fraction of image points that successfully passed the segment test, the processing time is kept short. Articles
11
Journal of Automation, Mobile Robotics & Intelligent Systems
Fig. 1. The illustration of the segment test used by the FAST feature detector.
Fig. 2. Masks used by the CenSurE (left) and STAR (right) feature descriptors. 2.3. SURF feature detector SURF [6] is an image feature detector and descriptor, inspired by the SIFT detector/descriptor. The main motivation for the development of SURF has been to overcome SIFT’s main weakness – its computational complexity and low execution speed. SURF has been reported to be up to a few times faster than SIFT without compromising the performance. The detection step in SURF takes advantage of the Haar wavelet approximation of the blob detector based on the Hessian determinant. The approximations of Haar wavelets can be efficiently computed using integral images, regardless of the scale. Accurate localization of multiscale SURF features requires interpolation. 2.4. STAR feature detector The STAR keypoint detector has been developed as a part of the OpenCV computer vision library. It is a derivative of the CenSurE (Center Surround Extrema) feature detector [7]. The authors of the solution aimed at the creation of a multiscale detector with full spatial resolution. As described in [7], the subsampling performed by SIFT and SURF affects the accuracy of feature localization. The detector uses a bi-level approximation of the Laplacian of Gaussians (LoG) filter. The circular shape of the mask is replaced by an approximation that allows to preserve rotational invariance and enables the use of integral images for efficient computation. Scale-space is constructed without interpolation, by applying masks of different size. The shape of the masks used for feature detection by CenSurE and STAR are given in Fig. 2. 2.5. SURF feature descriptor The SURF [6] feature descriptor uses Haar wavelets in conjunction with integral images to encode the distribution of pixel intensity values in the neighborhood of the detected feature while accounting for the feature’s scale. Computation of the descriptor for a given feature at the scale s begins with the assignment of the dominant orienta12
Articles
VOLUME 7,
N◦ 1
2013
Fig. 3. Division of the processed window into subregions and sampling points as used by the SURF feature descriptor. tion to make the descriptor rotation invariant. The process starts with computing the Haar wavelet responses in two dominant directions for every point within the radius of 6s from the feature. The size of the square wavelet masks is also adjusted according to the feature scale and set to 4s. The responses are then weighted with a Gaussian centered at the feature point. Each one of the responses gives rise to a point in the vector space, with the x-responses along the abscissa and the y-responses along the ordinate. Afterwards, a circle segment covering an angle of π3 is rotated around the origin (feature point). The responses under the segment are summed and form a resultant vector. The rotation angle corresponding to the longest resultant vector is selected as the dominant orientation of the feature descriptor. The computation of the descriptor itself starts with placing a square window with a side length of 20s centered on the feature point and oriented as computed in the previous step. The window is divided into 4 × 4 regular square subregions. Each subregion is divided into 5 × 5 uniformly distributed sample points. For each sample point, the Haar wavelet responses for two principal directions are computed as shown in Fig. 3. Each subregion contributes to the descriptor with four components: the sums of the responses in the two principal directions (dx , dy ) and their absolute values, as given in equation 3. X X X X DESCsub = [ dx, dy, |dx|, |dy|]
(3)
The responses from the 16 subregions are once again weighted with a Gaussian mask according to the relative position of the subregion and the analyzed point. For 16 subregions, the descriptor size is 64. 2.6. BRIEF feature descriptor The BRIEF [8] descriptor proposed in [8] uses binary strings for feature description and subsequent matching. This enables the use of Hamming distance to compute the descriptors similarity. Such similarity measure can be computed very efficiently – much faster than the commonly used L2 norm. Due to BRIEF’s sensitivity to noise, the image is smoothed with a simple averaging filter before applying the actual descriptor. The value of each bit contributing to the descriptor depends on the result of a comparison between the intensity values of two points inside an image segment centered on the currently described feature.
Journal of Automation, Mobile Robotics & Intelligent Systems
Fig. 4. An example random sampling pattern used by the BRIEF feature descriptor; point pairs used for comparison are connected with lines. The bit corresponding to a given point pair is set to 1 if the intensity value of the first point of this pair is higher than the intensity value of the second point, and reset otherwise. The authors of the original solution tested a few sampling strategies for selecting the point pairs. The empirical results have shown that sampling according to the Gaussian distribution centered on the described feature point results in best performance. The proposed descriptor is 512-bit long and computed over a 48 × 48 pixel image patch, although the length of the descriptor and the size of the window can be changed to adapt to the application at hand. The initial smoothing is performed with a 9 × 9 pixel rectangular averaging filter. The basic form of BRIEF is not invariant w.r.t. rotation. Example random sampling pattern used by BRIEF is given in Fig. 4. 2.7. ORB feature descriptor The ORB [9] (Oriented FAST and Rotated BRIEF) descriptor extends the BRIEF descriptor by adding two important improvements. The first one is to augment the descriptor with orientation data from the FAST feature detector. This allows to make the descriptor robust to inplane rotation. This is done by rotating the coordinates of the point pairs for binary tests around the described feature by the feature orientation angle. Second innovation is the selection scheme for point pairs whose comparisons contribute to the descriptor. The random sampling has been replaced with a sampling scheme that uses machine learning for de-correlating BRIEF features under rotational invariance. This makes the nearest neighbor search during matching less error-prone. 2.8. FREAK feature descriptor The FREAK [10] (Fast Retina Keypoint) descriptor is another extension of the basic concepts of BRIEF [8]. It provides the descriptor with feature orientation by summing the estimated local gradients over selected point pairs. Using a specific point sampling pattern allows to apply more coarse discretization of rotation, allowing for savings in memory consumption. A special, biologically inspired sampling pattern is also used. While the resulting descriptor is still a binary string, the sampling pattern allows for the use of a ’coarse-to-fine’ approach to feature description.
VOLUME 7,
N◦ 1
2013
Fig. 5. The four trajectories of the Robot Data Set and the position of the reference frame.
Fig. 6. Exemplary pair of images from the Robot Data Set with matches using the FAST detector and BRIEF descriptor. Point pairs carrying the information on most distinctive characteristics of the feature neighborhood are compared in the first place. This allows for faster rejection of false matches and shortening of the computation time.
3. Experiments 3.1. Datasets Two datasets were used in the experiments. The Robot Data Set [17, 18] consists of 60 scenes, registered from 119 positions under varying lighting conditions using ahigh resolution camera mounted on the industrial robot. The camera positions form 4 trajectories: 3 angular with constant distance from the scene and one linear with constant camera heading (Fig. 5). Such a diverse dataset allows to evaluate the robustness of detector-descriptor pairs with regard to the scale, rotation and illumination changes. Exemplary images from the dataset are presented in the Fig. 6. The second dataset used in the experiment is a video sequence gathered with the Kinect sensor mounted on the wheeled robot [20]. The ’Pioneer SLAM’ sequence consisting of 2921 images was used. The robot’s trajectory during the sequence registration is presented on the Fig. 7. Exemplary images from the second dataset are presented in the Fig. 8. The above datasets have been selected for the experiment as they reflect the two scenarios encountered in the context of mobile robot navigation. The first represents the situation in which the mobile robot observes the same scene from a number of positions differing in distance and viewing angle. It is important to notice that the camera reArticles
13
Journal of Automation, Mobile Robotics & Intelligent Systems
VOLUME 7,
N◦ 1
2013
5) the features from both images were matched by minimizing the distance between their descriptors resulting in the set of quadruples (ui , vi , Uj , Vj ) where (ui , vi ) are the normalized coordinates of the feature extracted from the first image and (Uj , Vj ) are the normalized coordinates of the best matching feature from the second image; 6) for each quadruple the symmetric reprojection error was calculated according to: err = M AX (|ei , (Uj , Vj )| , |ej , (ui , vi )|)
(5)
where ei and ej are the epipolar lines defined as:
Fig. 7. The ’Pioneer SLAM’ trajectory.
ai x + bi y + ci
=
0
Aj x + Bj y + Cj T ai bi ci T Aj B j C j
=
0
(6) (7)
T
= E ui , vi , 1 Uj , Vj , 1 E =
(8) (9)
If the error was smaller than the threshold thresh the match was considered an inlier;
Fig. 8. Exemplary images from the Freiburg Data Set with matches using the FAST detector and BRIEF descriptor. mains on a constant altitude which would also be the case for most of the mobile robots. The ability to match feature points and consequently recognize a previously seen scene is crucial especially in the context of loop-closing. The second dataset presents the challenges usually present in the tasks used in the localization of the mobile robot. The sequence consists of images gathered from the onboard camera where the consecutive images differ only slightly and usually no in-plane rotation is present. The ability to detect and match features in such sequences is especially important in tasks such as the visual odometry or visual SLAM. Both the datasets contain images, ground truth data of the camera position, camera intrinsic parameters and distortion coefficients. 3.2. Evaluation The following procedure was performed for every analyzed pair of images: 1) the point features were detected on both images using the selected detector; 2) the point features descriptors were calculated using the selected descriptor algorithm; 3) the coordinates of the features were undistorted and normalized according to the camera parameters; 4) the essential matrix E describing the images’ epipolar geometry was calculated using the relative ground truth T translation (t = tx ty tz ) and rotation (R) between the two camera positions: 0 −tz ty 0 −tx ; E = R tz (4) −ty tx 0 14
Articles
7) the ratio of the number of inliers to the number of matches was calculated. The final score of the detector-descriptor pair over a dataset was calculated as the mean of the inliers to matches ratios of all the image pairs in the dataset.
4. Results The Table 1 presents the average number of features found by the evaluated detectors on the images from both datasets. It is clearly visible that the STAR detector has been able to find the smallest number of features – over six times fewer than the second-worst detector. The Tables 2–5 contain the maximal and minimal matching ratios of the tested detector-descriptor pairs over the sequences from the Robot Data Set. As expected the maximal matching ratios have been observed for the smallest angular or linear displacement. Increasing the displacement has caused the reduction of the matching ratio down to the minimal values that have been observed for the biggest displacements. Please confront Fig. 9–14. Analogously, the Table 6 presents the maximal and minimal matching ratios of the detector desciptor pairs over the Freiburg dataset. The smallest ratios have been observed for the biggest difference in frames. Suprisingly, the best matching has been observed while matching everysecond frame. This shows that the computation of the visual odometry parameters under too small displacement of features between two consecutive images is inaccurate (Fig. 15). The experiments show, that binary vector based BRIEF and ORB descriptors have clearly outperform the other tested solutions. Interestingly, the third binary vector descriptor – FREAK – displayed a significantly lower matching accuracy than the other tested solutions, contrary to the claims in [10]. The popular SURF feature descriptor performed relatively well, but since it offers lower matching accuracy than BRIEF and ORB and is slower to compute and match, there are strong arguments against using it.
Journal of Automation, Mobile Robotics & Intelligent Systems
Tab. 1. The average number of detected features. Robot Image Dataset Freiburg Dataset Detector FAST 1469 634 Pyramid FAST 2292 1074 1469 868 GFTT PyramidGFTT 2984 773 4580 1421 SURF StarKeypoint 221 101
SURF is also the descriptor being most influenced by the type of the detector it is paired with. This is visible especially in the results from the Freiburg dataset (figure 15). It is worth noting that the use of pyramid FAST and GFTT detectors with a multi-scale SURF and ORB descriptor does not increase the matching performance when using a scale-aware descriptor. This is rather surprising, and is probably caused by the necessity of performing interpolation to determine the location of the feature, which negatively impacts on the feature location accuracy in higher scales. The Star keypoint detector is not burdened by this additional inaccuracy, as it offers full location accuracy across all scales. The results for the linear sequence shown in figure 14 support this claim. It must be noted however, that the Star descriptor requires a relatively feature-rich environment, as the average feature count it returns is relatively low as shown in Table 1.
5. Conclusions The various, contemporary point features detector and descriptor pairs were compared in order to determine the best combination for the task of robot visual navigation. The sequences chosen as a testbed displayed typical point feature distortions encountered during indoor mobile robot navigation – scaling and affine transformation with very little or none in-plane rotation. As all descriptors perform well when paired with the FAST corner detector, the FAST-BRIEF pair is a good choice when processing speed is a concern. Under the camera movement conditions featured in both of the test sequences used, the additional computational cost to bear when using descriptors and detectors robust to in-plane rotation and large scaling seems to be unjustified. This also confirms the need for testing the detector-descriptor pairs in the context of the application, as the requirements raised by it may differ significantly from the requirements raised by typical, commonly used benchmarking image sequences. In the future it is planned to compare the efficiency of the detector-descriptor pairs in the monocular SLAM system.
AUTHORS Adam Schmidt∗ – Poznań University of Technology, Institute of Control and Information Engineering, ul. Piotrowo 3A, 60-965 Poznań, Poland, e-mail: Adam.Schmidt@put.poznan.pl. Marek Kraft – Poznań University of Technology, Institute of Control and Information Engineering, ul. Piotrowo 3A, 60-965 Poznań, Poland, e-mail: Marek.Kraft@put.poznan.pl.
VOLUME 7,
N◦ 1
2013
Michał Fularz – Poznań University of Technology, Institute of Control and Information Engineering, ul. Piotrowo 3A, 60-965 Poznań, Poland, e-mail: Michal.Fularz@put.poznan.pl. Zuzanna Domagała – Poznań University of Technology, Institute of Control and Information Engineering, ul. Piotrowo 3A, 60-965 Poznań, Poland, e-mail: Zuzanna.Domagala@cie.put.poznan.pl. ∗
Corresponding author
Acknowledgements Adam Schmidt, Marek Kraft and Michał Fularz are scholarship holders within the project ”Scholarship support for PH.D. students specializing in majors strategic for Wielkopolska’s development“, Sub-measure 8.2.2 Human Capital Operational Programme, co-financed by European Union under the European Social Fund. This research was financed by the National Science Centre grant funded according to the decision DEC2011/01/N/ST7/05940, which is gratefully acknowledged.
References [1] D. Scaramuzza, F. Fraundorfer, “Visual Odometry: Part I – The First 30 Years and Fundamentals”, IEEE Robotics and Automation Magazine, vol. 18(4), 2011, pp. 80–92. [2] F. Fraundorfer, D. Scaramuzza, “Visual Odometry: Part II – Matching, Robustness and Applications”, IEEE Robotics and Automation Magazine, vol. 19(2), 2012, pp. 78–90. [3] A. J. Davison, I. Reid, N. Molton and O. Stasse, “MonoSLAM: Real-Time Single Camera SLAM”, IEEE Trans. PAMI, vol. 29(6), 2007, pp. 1052–1067. [4] A. Schmidt, A. Kasiński, “The Visual SLAM System for a Hexapod Robot”, Lecture Notes in Computer Science, vol. 6375, 2010, pp. 260–267. [5] E. Rosten, T. Drummond, “Machine learning for highspeed corner detection”. In: Proc. of European Conf. on Computer Vision, 2006, pp. 430–443. [6] H. Bay, A. Ess, T. Tuytelaars, L. Van Gool, “SURF: Speeded Up Robust Features”, Computer Vision and Image Understanding, vol. 110(3), 2008, pp. 346– 359. [7] M. Agrawal, K. Konolige, M.R. Blas, “CenSurE: Center surround extremas for realtime feature detection and matching”, Lecture Notes in Computer Science, vol. 5305, 2008, pp. 102–115. [8] M. Calonder, V. Lepetit, C. Strecha, and P. Fua, “BRIEF: Binary Robust Independent Elementary Features”. In: Proceedings of ECCV, 2010, pp. 778–792. [9] E. Rublee, V. Rabaud, K. Konolige, G. R. Bradski, “ORB: An efficient alternative to SIFT or SURF”. In: Proc. ICCV, 2011, pp. 2564–2571. [10] A. Alahi, R. Ortiz, P. Vandergheynst, “FREAK: Fast Retina Keypoint”. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2012. Articles
15
Journal of Automation, Mobile Robotics & Intelligent Systems
VOLUME 7,
Fig. 9. Matching ratio of points detected with the FAST detector. Robot Image Dataset.
Fig. 10. Matching ratio of points detected with the Pyramid FAST detector. Robot Image Dataset. 16
Articles
Nâ—Ś 1
2013
Journal of Automation, Mobile Robotics & Intelligent Systems
VOLUME 7,
Nâ—Ś 1
2013
Fig. 11. Matching ratio of points detected with the GFTT detector. Robot Image Dataset.
Fig. 12. Matching ratio of points detected with the Pyramid GFTT detector. Robot Image Dataset. Articles
17
Journal of Automation, Mobile Robotics & Intelligent Systems
VOLUME 7,
Fig. 13. Matching ratio of points detected with the SURF detector. Robot Image Dataset.
Fig. 14. Matching ratio of points detected with the StarKeypoint detector. Robot Image Dataset. 18
Articles
Nâ—Ś 1
2013
Journal of Automation, Mobile Robotics & Intelligent Systems
VOLUME 7,
N◦ 1
2013
Tab. 2. Maximal and minimal matching ration over the arc1 of the Robot Data Set. SURF BRIEF ORB FREAK MAX MIN MAX MIN MAX MIN MAX MIN FAST 0.71 0.31 0.95 0.47 0.93 0.45 0.49 0.23 0.76 0.32 0.95 0.46 0.94 0.45 0.49 0.24 PyramidFAST GFTT 0.87 0.35 0.93 0.41 0.87 0.32 0.31 0.22 0.46 0.24 0.97 0.47 0.95 0.44 0.52 0.23 PyramidGFTT SURF 0.51 0.24 0.96 0.46 0.94 0.43 0.51 0.23 0.77 0.35 0.91 0.48 0.90 0.46 0.47 0.27 STAR Tab. 3. Maximal and minimal matching ration over the arc2 of the Robot Data Set. SURF BRIEF ORB FREAK MAX MIN MAX MIN MAX MIN MAX MIN FAST 0.39 0.18 0.62 0.22 0.55 0.21 0.18 0.13 0.47 0.19 0.68 0.24 0.62 0.23 0.19 0.13 PyramidFAST GFTT 0.64 0.23 0.63 0.23 0.51 0.20 0.19 0.13 PyramidGFTT 0.23 0.14 0.68 0.24 0.59 0.23 0.21 0.14 0.27 0.15 0.70 0.24 0.63 0.23 0.21 0.14 SURF STAR 0.47 0.21 0.63 0.24 0.57 0.22 0.19 0.14 Tab. 4. Maximal and minimal matching ration over the arc3 of the Robot Data Set. SURF BRIEF ORB FREAK MAX MIN MAX MIN MAX MIN MAX MIN FAST 0.28 0.11 0.40 0.11 0.31 0.12 0.15 0.09 PyramidFAST 0.38 0.11 0.43 0.12 0.44 0.12 0.16 0.10 GFTT 0.56 0.11 0.39 0.12 0.32 0.12 0.14 0.11 PyramidGFTT 0.19 0.11 0.42 0.12 0.33 0.12 0.15 0.09 SURF 0.23 0.11 0.43 0.12 0.44 0.12 0.15 0.10 STAR 0.42 0.13 0.43 0.13 0.35 0.13 0.15 0.11 Tab. 5. Maximal and minimal matching ration over the linear of the Robot Data Set. SURF BRIEF ORB FREAK MAX MIN MAX MIN MAX MIN MAX MIN FAST 0.70 0.29 0.95 0.40 0.93 0.32 0.40 0.15 PyramidFAST 0.76 0.38 0.96 0.44 0.94 0.44 0.42 0.15 GFTT 0.89 0.56 0.94 0.39 0.89 0.32 0.24 0.15 PyramidGFTT 0.41 0.19 0.97 0.42 0.95 0.33 0.43 0.15 SURF 0.48 0.23 0.97 0.43 0.95 0.44 0.45 0.15 STAR 0.76 0.42 0.92 0.43 0.91 0.35 0.35 0.15 Tab. 6. Maximal and minimal matching ration over the Freiburg dataset. SURF BRIEF ORB FREAK MAX MIN MAX MIN MAX MIN MAX MIN FAST 0.42 0.24 0.76 0.39 0.73 0.38 0.19 0.14 PyramidFAST 0.53 0.28 0.78 0.40 0.76 0.39 0.20 0.15 GFTT 0.66 0.35 0.74 0.39 0.68 0.35 0.16 0.14 PyramidGFTT 0.23 0.15 0.77 0.39 0.73 0.38 0.19 0.14 SURF 0.32 0.19 0.77 0.40 0.75 0.39 0.18 0.14 STAR 0.56 0.31 0.75 0.41 0.74 0.40 0.25 0.19 [11] M. Kraft, A. Schmidt, A. Kasinski, “High-speed image feature detection using FPGA implementation of FAST algorithm”. In: Proc. VISAPP, 2008, pp. 174–179. [12] M. Kraft, M. Fularz, A. Kasiński, “System on chip coprocessors for high speed image feature detection and matching”. In: Proc. of Advances Concepts for
Intelligent Vision Systems, 2011, pp. 599–610. [13] M. Kraft, A. Schmidt, “Simplifying SURF feature descriptor to achieve real-time performance”. In: Proc. Computer Recognition Systems, 2011, pp. 431–440. [14] Ó. Martínez, A. Gil, M. Ballesta, O. Reinoso, “Interest Point Detectors for Visual SLAM”. In: Proc. of the Articles
19
Journal of Automation, Mobile Robotics & Intelligent Systems
VOLUME 7,
N◦ 1
2013
Fig. 15. Matching ratio of points detected with the analyzed detectors. Freiburg Dataset. Conference of the Spanish Association for Artificial Intelligence, 2007. [15] M. Ballesta, A. Gil, Ó. Martínez, O. Reinoso, “Local Descriptors for Visual SLAM”. In: Proc. Workshop on Robotics and Mathematics, 2007. [16] A. Schmidt, M. Kraft, A. Kasiński, “An evaluation of image feature detectors and descriptors for robot navigation”, Lecture Notes in Computer Science, vol. 6375, 2010, pp. 251–259. [17] H. Aanas, A. L. Dahl, K. S. Pedersen, “Interesting Interest Points – A Comparative Study of Interest Point Performance on a Unique data set”, International Journal of Computer Vision, vol. 97, 2011, pp. 18–35.
20
Articles
[18] A. L. Dahl, H. Aanas, K. S. Pedersen, “Finding the Best Feature Detector-Descriptor Combination”. In: Proc. of 3DIMPVT, 2011. [19] D. G. Lowe, “Distinctive Image Features from ScaleInvariant Keypoints”, International Journal of Computer Vision, vol. 60(2), 2004, pp. 91–110. [20] J. Sturm, N. Engelhard, F. Endres, W. Burgard, D. Cremer, “Towards a benchmark for RGB-D SLAM evaluation”. In: Proc. of the RGB-D Workshop on Advanced Reasoning with Depth Cameras at Robotics: Science and Systems Conf. (RSS), 2011. [21] J. Shi, C. Tomasi, “Good Features to Track”. In: Proc. of IEEE Conf. on Computer Vision and Pattern Recognition, 1994, pp. 593–600.
Journal of Automation, Mobile Robotics & Intelligent Systems
VOLUME 7,
Nâ—Ś 1
2013
Simultaneous Localization and Mapping for Tracked Wheel Robots Combining Monocular and Stereo Vision Received 10th October 2012; accepted 22nd November 2012.
Filipe Jesus, Rodrigo Ventura Abstract: This paper addresses an online 6D SLAM method for a tracked wheel robot in an unknown and unstructured environment. While the robot pose is represented by its position and orientation over a 3D space, the environment is mapped with natural landmarks in the same space, autonomously collected using visual data from feature detectors. The observation model employs opportunistically features detected from either monocular and stereo vision. These features are represented using an inverse depth parametrization. The motion model uses odometry readings from motor encoders and orientation changes measured with an IMU. A dimensional-bounded EKF (DBEKF) is introduced here, that keeps the dimension of the state bounded. A new landmark classifier using a Temporal Difference Learning methodology is used to identify undesired landmarks from the state. By forcing an upper bound to the number of landmarks in the EKF state, the computational complexity is reduced to up to a constant while not compromising its integrity. All experimental work was done using real data from RAPOSA-NG, a tracked wheel robot developed for Search and Rescue missions. Keywords: simultaneous localisation and mapping, extended Kalman filter, feature detector, inverse depth parametrization, landmark evaluation, temporal difference learning
1. Introduction SLAM is one of the most promising fields in robotics, aiming at tracking the location of a robot and map its surroundings using external sensor data. EKF, when applied to SLAM, proves to work reasonably well with distinct, well-matched observations and a small state for estimation. However, insertion of new data over time without removal increases EKF complexity, hindering its scalability over time. By memory-bounding the state, EKF complexity is assured to grow with time and, using proper classifiers, undesired features are automatically removed. A side effect from this removal procedure is that the map becomes visually sparse, but as long as it suffices the SLAM needs for stable predictions, one can use proper techniques to acquire visually more compelling maps. This work was implemented on RAPOSA-NG, a tracked wheel robot for SaR missions (Figure 1). This robot has an adjustable frontal body, where both the IMU and camera are located. Since most SaR robots perform motion within irregular terrains, this paper focuses on estimating both its position and attitude in a 3D euclidean space. It uses an IMU to measure orientation changes and encodes odometry from both wheels to measure translation changes
over time. Also, this paper introduces an elegant way to insert landmarks in the state from both monocular and stereo visualizations using the inverse depth parametrization. All landmarks are treated the same way, regardless of their origin. While monocular observations have no depth information, depth can be estimated through parallax changes over time. Stereo observations, on the other hand, provide depth, which not only allows for more accurate maps with fewer observations, but also solves the problem of map scale, common to most monocular SLAM techniques1 . J.J. Leonard and H.F. Durrant-Whyte introduced Simultaneous Localization and Mapping (SLAM) terminology to the robotics field and the concept of geometric beacons: natural landmarks present in the environment that can be reliably observed, as well as described in terms of a concise geometric parametrization (referred in this paper simply as landmarks) [5]. Geometric beacons can be acquired with many different types of sensors, as long as the aforementioned qualities are maintained. Davison et al. proposed a real-time algorithm which recovers the location of a monocular camera over time using SLAM with a random walk motion model [3]. However, feature initialization requires more than one observation, so that a proper triangulation for an initial depth estimate can be done. Also, it needs to acquire landmarks with known depth for scale initialization. Thus, Civera and Davison presented an inverse depth parametrization that represents landmarks uncertainty with more accuracy than the standard XYZ parametrization [2]. The increase of accuracy can be justified by the higher degree of linearity of the inverse depth parametrization over XYZ parametrization. However, this representation over-parametrizes each landmark (6 instead of the 3 components of XYZ), increasing the EKF complexity even further. They also defined a landmark classifier that removes 50% of all predicted landmarks that should be visible but are not detected by any feature detector. This approach leads to the landmark classifier introduced in this paper. The usage of a random walk model assumes a well behaved motion with smooth linear and angular velocities over time, a condition that often fails for tracked wheel robots in non-planar grounds (e.g., stair climbing). Pinies et al. included the usage of an IMU to the vision SLAM with inverse depth parametrization [7]. In fact, having orientation changes measured with an IMU, the uncertainty of the camera location is reduced. However, it does not decrease the uncertainty when only linear motion is observed, which leads to the need of odometry inclusion presented in this paper. As for the map scale problem, in order to solve it, this paper extends the inverse depth parametrization usage for stereo vision as well. Articles
21
Journal of Automation, Mobile Robotics & Intelligent Systems
VOLUME 7,
N◦ 1
2013
Top view
Side view
Stereo Camera
Fig. 2. Horizontal stereo camera representation. Fig. 1. RAPOSA-NG.
2. State and Model Definitions We address the problem of simultaneously estimating the robot pose and landmark positions (SLAM) using a probabilistic approach based on the Extended Kalman Filter. The state of this filter encompasses both the robot pose and the landmark positions. The motion model is used in the predict step, while the observation model is used in the update step, as usual [5]. 2.1. State Representation The SLAM algorithm estimates the pose of the camera frame with respect to a world frame. The camera frame is considered to be located at the midpoint between the two stereo cameras. The EKF state is defined as st = rtT qtT y1 T · · · yn
T T
,
(1)
where vector rt and unit quaternion qt represent the camera position and attitude (i.e., pose of the camera frame) in the world frame at time t. All yi from i ∈ {0, . . . , n} correspond to 3D point landmarks represented using an inverse depth parametrization, yi = (Xio Yio Zio θi φi pi )T
(2)
where (Xio , Yio , Zio )T is an arbitrary point in XYZ, θi and φi are the azimuth and elevation of the semi-ray that crosses both this point and the landmark in the world frame and pi is the inverse of the distance between (Xio , Yio , Zio )T and the landmark. This parametrization is capable of representing any landmark in space. One can compute yiXY Z , the landmark position in XYZ, by yiXY Z = oi +
1 mi pi
(3)
where
cos φi sin θi mi = − sin φi . cos φi cos θi
(4)
Usually, oi corresponds to the focal point of the camera when the landmark was first observed. While this parametrization has more degrees of freedom than necessary (6 instead of 3), it has interesting properties regarding linearity over EKF [2]. 22
Articles
2.2. Observation Model The observation model describes how each feature is perceived by the sensors. Each feature can be perceived in either stereo by both cameras, or mono by only one of the cameras. In any case, the observation model provides the expected pixel position of each feature, for each one of the cameras. This computation is performed in two steps: 1) For each landmark yi in state, compute a directional vector in the camera frame that points from the camera position to the landmark position as hi C t = Rqt∗ (pi (oi − rt ) + mi )
(5)
where Rqt∗ is the rotation matrix that rotates the world frame to the camera frame (from qt conjugate, qt∗ ). 2) Using the Pinhole Camera Model, for each landmark i situated in front of the camera, project the landmark position, along that directional vector, to each one of the image planes of the cameras. This model assumes a single camera with no lenses, nor aperture radius. It does not model any type of image distortion or blur present in every camera. For this paper, information retrieved for observation analysis passed through a correction process using camera’s proprietary software before being used by the EKF, returning an undistorted image with known intrinsic parameters, while maintaining a wide visual range. This software also rectifies each pair of stereo images, by projecting them to a common image plane [6]. If no software correction is available, distortion can be compensated with proper models using distortion parameters intrinsic to the camera, retrieved through calibration methods. An horizontal stereo camera is used in this paper to acquire image data from two different sources. Since all images are properly rectified, a given pair of features from both cameras only correspond to the same landmark if they both share the same horizontal axis. This rectification also results in a pair of images with the same size and intrinsic parameters. A set of coordinate frames, (X CLt , Y CLt , Z CLt ) and (X CRt , Y CRt , Z CRt ), are defined for the left and right camera, respectively. After the rectification process, both right and left camera frames share the same orientation as the camera frame and are displaced by b along the X Ct axis. To simplify the formalism, a parameter k LR is introduced, where 0, from left (L) camera k LR = (6) 1, from right (R) camera
Journal of Automation, Mobile Robotics & Intelligent Systems
Both directional vectors from left and right cameras, C hi CL and hi CR t t , can easily computed from hi t , CL/CR
hi t
k = hi C t − (−1)
LR
h0it
(7)
where h0it = pi ÂŻb
and
ÂŻb = b 0 0 T .
(8)
With the pinhole model, one can either model an observation from both cameras,
ziStereo t
   L    zuit −b zuit fC mu pi ďŁ ďŁ¸ b , = ďŁzuRit  = ďŁzuit  − C h z i t z 0 zv it v it (9)
or from left or right camera only,
ziLeft t
=
zuLit zv it
ziRight t
=
zuRit . zv it
(10)
2.3. Motion Model The motion model employs the odometry readings to estimate linear movement along the body frame and IMU readings to estimate incremental rotations of the robot. The frame transformations among the defined frames propagate these movement measurements, as well as their uncertainties (as covariances), to the camera frame. Three different frames are defined for the robot for each iteration t. 1) Camera frame, representing the camera pose at iteration t, as defined in subsection 2.1;
VOLUME 7,
The motion model employs the odometry readings to estimate linear movement along the body frame and IMU readings to estimate incremental rotations of the robot. In RAPOSA-NG the IMU is mounted on the frontal body, and thus the IMU frame shares the same attitude as the camera frame. From odometry the robot obtains linear movement along the body frame, by averaging the velocity of both tracks. Differential movement is discarded, since tracked wheel robots provide unreliable angular movement measurements from odometry. From the IMU, the robot obtains attitude changes. These changes are modeled as an angular velocity ω imu , defined by ωtimu = ωtgyro + ωtbias + ωt ,
(11)
where ωtgyro is the angular velocity retrieved from the IMU, ωtbias is the bias error normally associated with most
2013
IMUs (if the IMU uses optical or MEMS technology and is calibrated, it can be assumed no ωtbias for some period of time [10]) and ωt is a normally distributed error with zeromean. From ω imu one can obtain an incremental rotation qtimu using a zeroth-order integrator as described in [9]. The frame transformations among the frames defined in section 2.1 propagate these movement measurements, as well as their uncertainties (as covariances), to the camera frame. 2.4. Feature Initialization Over time, visual observations are made and new landmarks are attached to the state from observed visual features. Many criteria can be used to establish when new landmarks should be inserted and how many. For instance, one can add a new landmark every time a visual feature is observed that does not match any landmark in the state. However, doing so is computationally ineffective as it fills the state in a short time if no landmark removal procedure is performed. Assuming the usage of the stereo camera, one can acquire monocular features either from the left or from the right camera. Also, some features acquired from both cameras correspond to the same landmark, resulting in a stereo feature. Depending on whether the new landmark in state results from a monocular feature or from a stereo feature, two different initializations are introduced: 1) From a monocular observation: If a new landmark yn+1 is to be appended to the state from a feature detected by only one of the cameras, first a directional vector for the respective camera frame is computed using the Pinhole Camera Model. The (Xio , Yio , Zio )T is set to the respective camera center, and θi and φi are set to the azimuth and elevation of the semi-ray that crosses this point and the feature location in image plane.
2) Body frame, representing the robot body pose at iteration t. In RAPOSA-NG, the transformation between this frame and the previous one depends solely on the angle of the frontal body of the robot; 3) IMU frame, representing the IMU pose at instant t. The angular velocity ω imu can be modelled through the IMU gyroscopes. In RAPOSA-NG, this frame is attached to the frontal body.
Nâ—Ś 1

 hCL/CR n t =
L/R 1 (cu − zunt ) u  fC m  ďŁ fC 1mv (cv − zvnt ) 
(12)
1 Having Rqt from qt in state and knowing that both left and right camera frames share the same orientation from the robot state, the directional vector can be related to the world frame by hnt = Rqt hCL/CR n t
(13)
and    LR on rt + (−1)k Rqt ÂŻb  θn   arctan(hxn , hz )  nt t  =  ďŁĎ†n  ďŁarctan(−hy , |hxz nt |) nt pn p0 
(14)
where |hxz nt | =
p (hxn t )2 + (hxn t )2
(15)
It is impossible to gain depth information from just one observation, thus an initial arbitrary value p0 serves Articles
23
Journal of Automation, Mobile Robotics & Intelligent Systems
Fig. 3. Top view of camera frame for stereo vision of a landmark yi . as an initial estimation for the inverse depth given enough uncertainty. This parametrization is approximately linear along the corresponding semi-ray, allowing the EKF to sustain and correct large errors for the depth estimation. 2) From a stereo observation: Using epipolar geometry, one can compute the landmark parameters for the inverse depth parametrization, but including a measured depth, rather than a default value. The (Xio , Yio , Zio )T are set to the camera frame, θi and φi are set to the azimuth and elevation of the semi-ray that crosses the camera frame and the landmark XYZ position extracted from stereo. If the landmark comes from a stereo pair of features, using epipolar geometry it is possible to obtain the first a directional vector from the camera frame is computed using the Pinhole Camera Model. Since both left and right camera frames share the same distance from the camera frame but in opposite directions, the directional vector can be calculated as
hC nt =
1 fC mu
(cu − 12 (zuLnt + zuRnt )) 1 . fC mv (cv − zvnt ) 1
(16)
In the same fashion as equation (13), hnt = Rqt hC nt
(17)
and
rt on θn arctan(hxn t , hz nt ) = φn arctan(−hy , |hxz nt |) , nt pn pe
(18)
where pe can be computed using epipolar geometry [4], pe =
24
Articles
zuLnt − zuRnt 2 b fC mu |hC nt |
.
(19)
VOLUME 7,
N◦ 1
2013
2.5. Feature Detector and Descriptor For landmark detection and matching, this work uses ORB (Oriented FAST and Rotated BRIEF), a rotation-only invariant feature detector and descriptor [8]. Although less reliable than SURF [1] and no scale invariant, it still behaves with great accuracy for small scale changes. While both methods have good performances, ORB is faster but less accurate than SURF regarding scale changes. However, if those changes are considered small, ORB accuracy suffices the SLAM needs. ORB computes FAST features with added orientation information from the intensity centroid. For the descriptors, it uses BRIEF descriptors, calculated from binary intensity tests and rotated using the orientation assigned.
3. Dimensional-Bounded EKF (DBEKF) One of the major problems regarding the Extended Kalman Filter is the fact that its computational complexity increases over a quadratic order with the number of landmarks. This paper introduces a Dimensional-Bounded Extended Kalman Filter (DBEKF) which equips EKF with criteria for landmarks insertion and removal. For that, a new Landmark Classifier has to be introduced first. Figure 4 shows the proposed DBEKF architecture. 3.1. Landmark Classifier For the DBEKF, a landmark yi is said to be visible in state st , yi ∈ Vst , if it is observable from state st . Also, yi is detected at iteration t, yi ∈ Dt , if the feature detector points out a corresponding feature. In a perfect scenario without any physical occlusions, Vst = Dt , that is, if the landmark is visible it should be detected. However, feature detectors are prone to error: descriptors can fail to point out some correspondences and miss features from being detected. These inaccuracies are crucial to classify each landmark’s usability in state. Since it is assumed that no landmarks have physical occlusions, a visible but not detected landmark can only represent a failed match. In these cases, failed matches promote the corresponding landmark to be removed from the state. A Temporal Difference Learning approach is used to predict a measure of the utility, uit , of each landmark at iteration t: ( G uit−1 + (1 − G) 1iDt if yi ∈ Vst i ut = (20) uit−1 otherwise. where G is a arbitrary weight set by the user and the indicator function (G ∈ [0, 1]), 1iDs , is defined for detectability, t ( 1 if yi ∈ Dst 1iDst = . (21) 0 else The lower the utility of a landmark, more likely it is to be removed from the state. The initial value for utility is ui0 = 1. Due to (20) and (21), uit ∈ [0, 1]. 3.2. Landmark Removal Assuming Ml as the maximum number of landmarks imposed by the user to the DBEKF, the Landmark removal procedure is composed of three criteria:
Journal of Automation, Mobile Robotics & Intelligent Systems
VOLUME 7,
N◦ 1
2013
Start
State Initialization
yes
New odometry and IMU readings? no
yes
Landmark insertion
Predict
New image with visual features? no
Update Landmarks to be removed?
yes
no
no
New landmarks to be added?
yes
AA
Square
Remove Landmarks
Fig. 4. DBEKF flowchart.
2) Negative Inverse Depth: all features with negative inverse depth (e.g., due to a feature mismatch) are automatically discarded from the state; 3) Emergency Removal: if the amount of matched landmarks is below a threshold ml , the oldest landmarks are removed from the state as such to leave room for the new ones.
B
Stairs
Fig. 5. Datasets A (top) and B (bottom). The indicated line represents the trajectory travelled by RAPOSA-NG during the experiment, in the direction pointed by the arrowhead. 0.1 Time (seconds)
1) Utility Threshold: when uit reaches a value below T ∈ [0, 1] at iteration t, it is discarded from the state;
0.08 0.06 0.04 0.02 0
200
400
600 Iteration
800
1000
1200
0
200
400
600 Iteration
800
1000
1200
10
5
0
(a)
Time (seconds)
0.1 0.08 0.06 0.04 0.02 0 Number of Features Removed
All experimental results presented in this paper are from two different datasets made with RAPOSA-NG. Each dataset is the result of a ROS2 log file recording during operation. It contains odometry readings from left track, right track and inclination arm position at 10 Hz each, angular velocity readings from IMU at 30 Hz, rectified images from both cameras at 15 Hz, and all features retrieved from image readings using feature detector ORB at 15 Hz. Unless otherwise stated, all tests performed with DBEKF have an upper bound of Ml = 60 landmarks in state, an utility weight factor of G = 0.8, an utility threshold of T = 0.01 and a minimal number of matched landmarks per observation of ml = 10. Results may slightly vary for the same dataset in different runs, since the log file is played in real time (as well as the SLAM algorithm). The datasets are denoted here A and B, both illustrated in figure 5. In dataset A, RAPOSA-NG performs a near-squared trip of 3 × 3 meters in a soccer field full of newspaper pages, wooden planks and other sort of objects, simulating debris. Dataset B comprises RAPOSA-NG climbing up and down a set of stairs. The stairs set has 0.62 meters of height. By upper limiting the number of landmarks in state by a value Mlandmarks , EKF computational complexity becomes upper bounded as well. If there are enough observations to grant Mlandmarks in state for estimation at every iteration, the computational load should be constant for all time. This situation happens to all experiments presented on this thesis. Figure 6 shows the time duration and number of removed features per DBEKF iteration with both monocular and stereo observations from datasets “A” and “B”, respectively. As expected, the computational complexity is
Number of Features Removed
0
4. Results
0
50
100
150
200
250 300 Iteration
350
400
450
0
50
100
150
200
250 300 Iteration
350
400
450
10
5
0
(b)
Fig. 6. Time duration (top) and number of removed features (bottom) per DBEKF iteration for datasets “A” (a) and “B” (b) with both monocular and stereo observations.
near constant for all time during both experiments, presenting some peaks and fluctuations due to new landmark initialization, the number of feature observations per update and other processing tasks unrelated to this software. Regarding dataset “A”, one can notice some time intervals where a larger number of features are removed. These intervals happen when the robot finishes rotating 90 degrees Articles
25
Journal of Automation, Mobile Robotics & Intelligent Systems
VOLUME 7,
N◦ 1
2013
0.16
Time per iteration (seconds)
0.14
Dataset "A" Dataset "B"
0.12 0.1 0.08 0.06 0.04 0.02 0 20
30
40 50 60 70 80 Max. Number of Landmarks in State
90
100
Fig. 7. Average time per iteration for different upper bounds in DBEKF for datasets “A” and “B”.
Fig. 9. SLAM results using DBEKF with dataset B. Refer to figure 8 for graphical notation. 0.015
Mono Stereo Mono + Stereo
0.01
0.005
0 0
Fig. 8. SLAM results using DBEKF with dataset A. The trajectories shown are: in blue/grey is the camera trajectory with only odometry and IMU, while the black one used SLAM estimation. Covariance for the final position is also shown, as well as the final covariances of the landmarks in state.
200
400
1000
1200
(a)
−3
2.5
600 800 Iteration
x 10
2
Mono Stereo Mono + Stereo
1.5 1
and faces a new plane of observations, requiring space for new landmarks in state. It then discards older landmarks in order to acquire new ones. Dataset “B” presents some peaks regarding the number of feature observations as well, where the robot experienced drastic observation changes due to the rough movement of RAPOSA when finishing climbing up or starting to climb down the stairs. From the average time reading in both experiments, it is clear that the SLAM algorithm fully performs in real time. However, it does not take into account the time needed for feature acquisition using feature detectors such as SURF or ORB. Figure 7 shows the average time per iteration for different upper bounds in DBEKF with both datasets. As expected, despite the dataset, the time per iteration rises near a cubic order. It is important that, although the computational power becomes constant, a reasonable upper bound is chosen to avoid large time intervals for the EKF that can otherwise reveal linearity problems. Figures 8 and 9 present the trajectories estimated by DBEKF for datasets A and B, in comparison with using odometry alone. Note that on both cases the algorithm was capable of correcting the errors induced by odometry, being able to return to the initial position of each dataset (note that no closure detection algorithm was employed). In Figure 10 a comparison between several configu26
Articles
0.5 0 0
100
200 300 Iteration
400
(b)
Fig. 10. Evolution of the pose covariance from SLAM using DBEKF for dataset A (a) and B (b) along time. The different traces correspond to the use of mono features only, of stereo features only, and of all features.
rations is made, in terms of estimated uncertainty of the robot pose. This uncertainty is measured in terms of the trace of the position estimation covariance matrix (the corresponding 3 × 3 submatrix of the state covariance). The configurations are: use of monocular features only, use of stereo features only, and use of all features. This result shows that the combination of mono and stereo features outperform any other configuration.
5. Conclusions The usage of both cameras as a stereo vision decreases the uncertainty from all landmarks and allows a better initialization for the SLAM algorithm, but the lack of stereo
Journal of Automation, Mobile Robotics & Intelligent Systems
features may offer some problems to the SLAM problem if no other type of observations are used. In this work, both stereo and monocular features are used as observations, and as such one can use monocular information without worrying with the map scale problem (referred in Section 1) as long as stereo observations are available. From the presented results, it is clear that using both monocular and stereo observations in the way introduced here increases the overall quality of SLAM over monocular only or stereo only observations. Although the usage of the Extended Kalman Filter (EKF) has been extensively used to solve the SLAM problem, its computational complexity grows unbounded with the number of landmarks. This paper showed that, with DBEKF, one can achieve good estimations with constant complexity when removing landmarks from state according to an utility evaluation criterion.
VOLUME 7,
N◦ 1
2013
[7] P. Pinies, T. Lupton, S. Sukkarieh, J.D. Tardos, “Inertial aiding of inverse depth SLAM using a monocular camera”. In: IEEE International Conference on Robotics and Automation, 2007, pp. 2797–2802. [8] E. Rublee, V. Rabaud, K. Konolige, G. Bradski, “ORB: An efficient alternative to SIFT or SURF”. In: International Conference on Computer Vision, Barcelona, 2011. [9] N. Trawny, S.I. Roumeliotis, “Indirect kalman filter for 3D attitude estimation”, Technical Report, University of Minnesota, Dept. of Comp. Sci. & Eng., 2005. [10] O.J. Woodman, “An introduction to inertial navigation”, Technical Report, UCAM-CL-TR-696, University of Cambridge, Computer Laboratory, 2007.
Notes 1 Assuming
no a priori initialization of map scale.
2 http://www.ros.org
AUTHORS Filipe Jesus∗ – Institute for Systems and Robotics, Instituto Superior Técnico, Av. Rovisco Pais, 1, Lisbon, Portugal, e-mail: filipejesus@ist.utl.pt. Rodrigo Ventura – Institute for Systems and Robotics, Instituto Superior Técnico, Av. Rovisco Pais, 1, Lisbon, Portugal, e-mail: rodrigo.ventura@isr.ist.utl.pt. ∗
Corresponding author
Acknowledgements This work was supported by the FCT projects [PEst-OE/EEI/LA0009/2011] and [PTDC/EIACCO/113257/2009].
References [1] H. Bay, A. Ess, T. Tuytelaars, L. Van Gool, “Speededup robust features (SURF)”, Comput. Vis. Image Underst., 110(3), 2008, pp. 346–359. [2] J. Civera, A.J. Davison, and J. Montiel, “Inverse depth parametrization for monocular SLAM”, IEEE Transactions on Robotics, 24(5), 2008, pp. 932–945. [3] A.J. Davison, I.D. Reid, N.D. Molton, O. Stasse, “MonoSLAM: Real-Time Single Camera SLAM”, IEEE Trans. Pattern Anal. Mach. Intell., 29(6), 2007, pp. 1052–1067. [4] R.I. Hartley, A. Zisserman, “Multiple View Geometry in Computer Vision”, Cambridge University Press, 2nd edition, 2004. [5] J.J. Leonard, H.F. Durrant-Whyte, “Simultaneous map building and localization for an autonomous mobile robot”. In: IEEE/RSJ International Workshop on Intelligent Robots and Systems, 1991, pp. 1442– 1447. [6] D. Oram, “Rectification for any epipolar geometry”. In: BMVC’01 2001, pp. 653–662. Articles
27
Journal of Automation, Mobile Robotics & Intelligent Systems
VOLUME 7,
Nâ—Ś 1
2013
Absolute Localization for Low Capability Robots in Structured Environments Using Barcode Landmarks Received 10th October 2012; accepted 22nd November 2012.
Duarte Dias, Rodrigo Ventura Abstract: Research on multi-robot systems often demands the use of a large population of small, cheap, and low capability mobile robots. Many non-trivial behaviors demand these robots to be localized in real-time. This paper addresses the problem of absolute localization of such low capability robots using onboard sensors and local computation. The approach is based on the use of a pair of scan lines perceived by an onboard B&W camera to detect and decode artificial visual landmarks deployed along the environment. Each landmark consists on a dual-layer barcode which can encode its pose with respect to a global coordinate frame. Thus, the robot is not required to store a map of the landmark locations onboard. The method is based on an Extended Kalman Filter (EKF) fusing odometry readings with absolute pose estimates obtained from the camera. Experimental results using an e-puck robot with 8 KB of RAM and a 16 MIPs processor are presented, comparing the location estimates with both ground truth and odometry. Keywords: vision based barcode detection, homography transformation, absolute robot localization, low capability robots
1. Introduction There are several multi-robot systems which make use of numerous relatively simple robots for potential grouplevel benefits including scalability, flexibility, and robustness to individual failures. Swarm robotics is a good example of such a system [1, 2]. However, due to practical reasons (e.g., cost), each robot contains reduced sensing, energy and computational capabilities. This makes it difficult to include localization systems necessary for most practical, task oriented, applications, since these involve the movement of individuals and/or the transport of objects through precise goal points. This paper addresses this problem by proposing an onboard localization system for low capability robots. The localization system is based on an EKF, fusing movement predictions directly provided by the noisy measurements of the wheel movement, with absolute pose estimates obtained from scanning barcode landmarks with a camera, onboard the robot. Each barcode is uniquely identified by its embedded code. However, in presence of critical memory space restrictions, the embedded code can also encode the landmarks absolute poses in the environment, thus preventing the usage of lookup tables. A simplified version of the standard barcode encoding methods is used to save computational effort. This work assumes structured environments based on corridors and vertical white walls. The robot used to test the developed 28
Articles
system is the e-puck educational robot from EPFL university, due to its extremely low computational (16 MIPs) and memory (8 KB) capabilities, and also the broad development support since it has a large userbase. Its reduced memory only allow for the extraction of 8 B&W lines of camera image. More information about the e-puck features and architecture can be found in [3]. Some localization systems have been developed for the e-puck. In [4] a decentralized localization system is described, using an external mobile component with high computational capability and preforming autonomous localization, to compute the position in the environment of each e-puck. In [5], another decentralized system is proposed, in which an external component receives the sensor data from the e-puck camera and performs landmark detection and localization processing with particle filters. The landmarks are colored which demands more computational and memory capabilities. In [6], a centralized system was proposed, also based on particle filters for localization, using gray images to extract bearing information from black landmarks in the scenario. This system is lighter in terms of landmark detection but it allows only for relative positioning of the robot to the landmarks, requiring extra model matching techniques [7] to obtain the absolute position of the robot. The first two systems involve external components to compensate for computational limitations, which can compromise navigation decisions through communication overhead, and is sometimes impractical due to communication limitations. The last two systems require a lookup table for the landmark positions, which compromises the scalability of the number of landmark used, bounding the environment size with respect to the landmark density necessary to achieve the desired localization performances. This factor is here of extreme importance given the low memory space of the used robot. Several algorithms that can perform barcode detection already exist for simple portable 2D cameras. The detection process can be divided into three parts: 1) Barcode localization in the image: most methods extract regions where it is most likely the barcode to exist, called regions of interest (ROI), using gradient methods [8–11] and wavelet transforms [12] to find regions with high unidirectional derivative density caused by the barcode stripes, or binarization methods to find dense black and white regions [13]. 2) Obtaining barcode geometric parameters such as the lengths of projected stripes in the camera, and geometric distortions induced by the perspective transformation inherent of the observation point: in [9, 12, 13] single scan lines are used to extract the stripe length, and in [8, 10] a perspective transformation estimation, intrinsically con-
Journal of Automation, Mobile Robotics & Intelligent Systems
taining the stripe length and the geometric distortions, is performed. 3) Decoding the embedded code, directly using the scan lines and the computed geometric parameters to obtain the barcode stripes, and using the barcode encoding to decode them. The described methodologies use the whole image and introduce great computational complexity, in terms of barcode localization, through image processing. This paper introduces a simplified 1D version of the above methods, limited to the possible 8 B&W camera lines, based on additional restrictions for the barcode positioning on the environment.
2. Barcode Localization
VOLUME 7,
N◦ 1
2013
Fig. 2. Geometry of the problem: a double layered barcode scan with the chosen scan lines for each layer and the respective projections in the barcode surface (in red/darker).
2.1. Barcode Design Framework To maximize the number of encoded bits, the used barcode type, illustrated in Fig. 1 is composed of two layers of Nstripes geometrically identical white (spaces or ’0’s) and black (bars or ’1’s) stripes of Xdim width and h 2 height. It is composed of three major areas:1) embedded code, which contains the unique barcode sequence, and can also allow the encoding of the binary representation b b of the barcode’s pose in the environment, (xbw , yw , θw ); 2) guard sets, included at each barcode horizontal edge, with Nguard bar-space pairs; 3) quiet zones, composed of a set of spaces placed around the barcode to prevent detectors from confusing the barcode’s signal from the rest of the signal. Fig. 2 depicts the geometry of the problem. A 3D frame associated to each barcode is inserted in barcode’s center, thus defining its pose in the environment. The z axis defines the direction from the lower layer to the upper layer.
Fig. 3. The Ground projection of a double barcode scan, seen through the camera frame.
Fig. 1. Barcode major areas and geometric parameters. The camera, modeled here by the pinhole model, with a focal length of f and a principal point of coordinates (ox , oy ), uses two horizontal scan lines, equally distant from the optical axis, for barcode detection. The scan lines are taken from the top and bottom of the captured image, which contains Qx pixels width and Qy pixels height, producing respectively, horizontal and vertical fields of view, named F OVx and F OVy . The vertical center of the barcode must be aligned with the camera’s height to ensure each scan line is aligned with its layer. Fig. 2 and Fig. 3 illustrate a barcode observation situation and its respective projection in the ground plane. The intersections of the optical axis, ro , and its orthogonal axis, rp , with the y axis of the barcode are named respectively, B2 and B1 . From Fig. 3, it can be concluded that a point in the barcode surface, b, relates to the respective projected point in the camera plane, ∆, by the following:
cos(θ) b − B2 B2 − b ox − ∆ = =m f sin(θ) B1 − b B1 − b
(1)
where θ is the inclination of the camera to the barcode and m is the slope of rp defined as −(tan(θ))−1 . Relevant θ’s are contained in the [− π2 , π2 ] interval, but due to the symmetry of the problem one only needs to consider the [0, π2 ] part to analyze the problem for the whole interval. Considering the points associated to the stripe transitions of the barcode, respectively bk and ∆k , the width of each barcode stripe projected in the camera, Lk , can be defined as |∆k−1 − ∆k |. Using (1), this width can be expressed as follows: Lk = f Xdim m
B1 − B2 , (B1 − bk−1 )(B1 − bk )
(2)
where bk = (0.5Nstripes − k)Xdim . The minimum observation boundary which allows barcode detection, for each θ, is described by the geometrical locus from where at least one barcode edge coincides with the limits of the Articles
29
Journal of Automation, Mobile Robotics & Intelligent Systems
VOLUME 7,
N◦ 1
2013
camera’s F OVx , defined by two lines, with the following expressions: F OVx y = b0 + tan θ − x (3) 2 F OVx x (4) y = bNstripes + tan θ + 2 The maximum observation boundary, for each θ, is described by the geometrical locus from where the camera is observing the barcode with at least one projected stripe with the minimum length that still allows barcode detection, Lmin . Replacing, in (2), Lk with Lmin , and B1 and B2 with their respective expressions in function of x and y, respectively defined by the line expressions of rp and ro , one can describe this geometrical place as an hyperbola: x2 Axx +y 2 Ayy +2xyAxy +2xBx +2yBy +C = 0 (5) Ayy = tan2 (θ)
Axx = 1 Bx =
Xdim 2
(Nstripes − 1) tan(θ) −
Axy = tan(θ) f
1 Cmin cos2 (θ)
Xdim (Nstripes − 1) tan(θ) 2 X 2 (Nstripes − 2)Nstripes tan2 (θ) C = − dim 4 The Nyquist theorem for digital signal recovery states that the receiver cannot recover signals with frequencies higher than one half the sampling period. Since the camera samples the signal with a period of 1 pixel, the lowest period allowed for the signal is of 2 pixels, which means Lmin is 1 pixel. The detection area, for each θ, is found between the minimum and maximum observation boundaries, defined respectively, by (3) and (4) and by (5). Also, the scan line inclination cause, from a certain distance, d0max , illustrated in Fig. 4, parts of the barcode not to be captured, thus precluding the detection. The respective distance to the barcode center, can be expressed as: By =
d = d0max − 0.5 w sin(θ) = h
f − 0.5 w sin(θ) Qy
(6)
Expressions (3) to (6) constrain the following design variables: the number of stripes (Ninf ), their size (Xdim and h), the camera’s focal length (f ) and its image size (Qx and Qy ). They provide a framework which allows the choice of the barcode and camera parameters according to the detection area requirements. 2.2. Barcode Detection Algorithm Each barcode layer is independently detected by each respective scan line. Fig. 5 represents a typical grayscale image retrieved by the camera. The detection process for each scan line can be described as follows: 1) Binarize the signal into black and white regions; 2) Generate barcode hypotheses; 3) Obtain the embedded codes from each hypothesis; 30
Articles
Fig. 4. Side view of the detection problem, from a plane orthogonal to the ground, parallel to the camera’s optical axis and passing through the barcode’s center.
4) Extract the most likely camera observation pose in the barcode frame, for each hypothesis. The detections on both scan lines, whose horizontal positions differ by less than a few pixels, are merged together to form a complete barcode. Next, the embedded code is extracted and the binary sequence obtained, inversely applying the method used for encoding. In this case, the NRZ method is used. A Cyclic Redundancy Check (CRC) sequence in the end of the code is used to prevent bit detection errors.
Fig. 5. Typical barcode structure on a captured image. The binarization process is performed using a threshold computed from the mean of the maximum and minimum pixel intensities of the scan line. To filter regions which are too close to the computed threshold, a distance between that threshold and the intensity of each region is used. Barcode hypotheses are formed by extracting ROIs from the binarized signal. Black regions around a seed point, consisting in the smallest black region, are connected until the predicted barcode size, computed from the width of the seed and the barcode specifications (Nstripes and quiet zones), is reached. Several ROIs can be extracted by repeating the process to the remaining regions of the scan line. The method has the problem of not considering the geometric distortions for the ROI length computation, which requires the barcodes to be a certain distance apart from each other, otherwise more than one barcode can be caught inside the same ROI. Since the walls of the scenario are assumed white, an extracted ROI must be related to a barcode. Also, for each hypothesis, a perspective transformation, H, translating points in the camera plane, ∆, to points in the barcode surface, b, is computed, using the transitions of the hypothesis’s guard sets, present in each ROI edge. The homography concept is used to define the transformation, since it provides a linear system, resulting in fast estimations. The former points in homogeneous coordinates are T T respectively, ∆h = ∆ 1 and bh = b 1 .
Journal of Automation, Mobile Robotics & Intelligent Systems
From Fig. 3, one can conclude the following: T T T T β0 α0 , B= , ∆h = λA−1 Bbh A= β1T α1T (7) where α0 and α1 , β0 and β1 , are the vectors describing the position and orientation of respectively, the camera plane and the barcode frame, in respect to the camera’s focal point, which defines the camera frame. α1 and β1 are defined as unit vectors. Thus, A and B are 2x2 matrices. A−1 B is encapsulated in a 2x2 homography matrix H. The embedded code for each hypothesis is obtained by sampling the respective ROI according to a grid of stripe middle points, which is built using the estimated perspective transformation, as shown in Fig. 6. The sampling is done sequentially from the borders to the center of the ROI while enhancing the perspective transformation, H, with each new transition found.
VOLUME 7,
N◦ 1
2013
3. Robot Localization System 3.1. Observation Model The robot pose in the environment, (xk , yk , θk ), is estimated onboard the robot. Periodic movement information, obtained directly from the stepper motor pulses, is used in the predict of the EKF algorithm, applying a linear motion model. Barcode measurements, acquired by a barcode sensor, from applying the above described method to the captured images, are used in the update of the EKF algorithm. 3.2. Observation Model Observing Fig. 7, (xk , yk , θk ) can be obtained from the robot pose in the barcode frame, (xrb , ybr , θbr ) and the b b barcode absolute pose in the environment, (xbw , yw , θw ), using the following transformation: r b b b xw xk cos(θw ) − sin(θw ) 0 xb b b b yk = sin(θw ) cos(θw ) 0 ybr + yw b θk θbr 0 0 1 θw (10) b b The (xbw , yw , θw ) can be directly decoded from the embedded code, or obtained from a lookup table using the embedded code has the landmark identifier. The (xrb , ybr , θbr ) is obtained relating (xcb , ybc , θbc ) computed for the respective barcode detection, with the camera’s pose in the robot frame. Expression (10) is used as the observation model. An empirical model for the noise associated to the extraction of (xcb , ybc , θbc ), is also added. This model is discussed in the result section.
Fig. 6. Projected middle stripe points and barcode transitions, predicted from the estimated transformation during the detection process. The camera pose in the barcode frame, (xcb , ybc , θbc ), is obtained from a simple homography decomposition. Observing Fig. 3, one can easily build matrix A, in (7), using the camera parameters. From A and the estimated H, B can be computed apart from a scale factor, necessary to make β1 a unit vector. From Fig. 3, one can see that β0 is directly the barcode position in the camera frame, and β1 defines its orientation in the same frame, which is of opposite value of the camera orientation in the barcode frame, θbc . The expression for that angle is as follows: −θbc
= π + sgn
β1x β1 y
arccos r
1+
1
β1x β1y
2 (8)
The camera position in the barcode frame is obtained by applying the inverse of the barcode axis transformation matrix to the origin of the camera frame, as follows: −1 cos(θbc ) sin(θbc ) β0x 0 xcb ybc = − sin(θbc ) cos(θbc ) β0y 0 1 0 0 1 1
(9)
Fig. 7. Relations between world frame, barcode frame and robot frame. 3.3. Onboard implementation The robot pose in the environment, (xk , yk , θk ), is estimated onboard the robot. Movement information, obtained directly from the stepper motor pulses, is used in the predict of the EKF algorithm, applying a linear motion model. Barcode measurements are acquired from applying the above described method to the captured images, and are used in the update of the EKF algorithm. The localization system is divided into three main phases: 1) barcode detection for every captured image, which is always running, 2) periodic extraction of movement information relative the last extraction, assigning a time stamp referring to the extraction time, and 3) pose estimation, applying EKF predict and update steps with the odometry information and Articles
31
Journal of Automation, Mobile Robotics & Intelligent Systems
VOLUME 7,
N◦ 1
2013
the barcode observations obtained during the last image processing, ordered according their time stamps. The barcode observations have their time stamps set to the middle of their image processing period. Fig. 8 shows the time diagram for the three phases.
Fig. 9. Barcode detection rate with distance: a) Frontal perspective situation using 4 random barcodes. b) Several inclination situations for a random barcode, with the theoretical maximum observation distances (black dots) for each considered inclination.
Fig. 8. Time diagram for the major processing phases of the localization system.
4. Experimental Results The accuracy of the barcode design framework and the performance of the barcode detection algorithm were evaluated based on a dataset with 30 barcode visualizations for each distance, d, and inclination, θ, pair, with d from 10 cm to 60 cm in 5 cm steps, and θ from 00 to 600 in 150 steps. The barcodes used have 12 stripes in each layer for the embedded code, 2 bar-space pairs for each guard set, and each stripe have 1.5 mm of width and 1.25 cm of height. b b The embedded code reserves 24 bits to store (xbw , yw , θw ) and the last 8 bits to store the CRC error detection bits. The camera used has a focal length of 80 vertical pixels and 640 horizontal pixels, and the image frame has 480 pixels width and 8 pixels height. Two cases were here considered. The first, shown in Fig. 9 a), consists in frontal barcode detections (θ = 00 ), to test the robustness of the detection range to barcode variations. Four random barcodes were used and for all of them, the respective detection rates drop sharply at 50 cm, which is the maximum distance derived from the scan line inclination, computable from expression (6) for θ = 00 . Only 3 false detections in the 1320 detections were found. The second case, shown in Fig. 9 b), consists in barcode detections for the several θ values, using just one random barcode. The predicted detection ranges, computable from the barcode design framework (equation (5)) and represented in Fig. 9 b), were matched with the experimental ranges. One can observe that, for the first two θ values, the detection ranges are below their predicted values. This happens since the respective ranges are higher than the maximum distance derived by the scan line inclinations, which is about 50 cm. For the other θ values, predicted ranges are consistent with the results. An Lmin of 1.5 pixels, instead of the early discussed 1 pixel, was used since it better explains the results. For the latter case, the accuracy for the estimation of the relative pose of the camera in the barcode frame was ana32
Articles
lysed by performing a statistical analysis on the computed relative pose estimates from all the experiments, shown in Fig. 10. From this data, an empirical model, based on the 3x3 covariance matrix representing the noise when extracting (xcb , ybc , θbc ), can be built. The coordinates (xcb , ybc ) are assumed independent from θbc . The data shows that the uncertainty in (xcb , ybc ) has an exponential behavior with the distance to the barcode, d, and follows the inclination to the barcode, θ. The empirical regressions, using mean squares, for both eigenvalues and the rotation angle, for the 2x2 matrix representing that uncertainty, are as follows: λ1 (d) = e25.70d−15.46
λ2 (d) = e16.15d−17.27
rotation(θ) = 1.91θ − 1.43 sgn(θ)
(11) (12)
c Where p c θ canc be replaced with θb + π, and d with (xb )2 + (xb )2 .
Fig. 10. Statistics for the extraction of the camera’s pose in barcode frame, for a random barcode. The localization system was evaluated using the scenario defined by Fig. 11. Using a joystick, the robot performs several laps around the scenario. The pose was estimated in real time and transmitted to an external computer. Odometry measurements were saved between periods of 100 ms and the barcode sensor took about 400 ms to process each image for barcodes (300 ms for image capture and 100 ms for image processing). A marker is placed on top of the robot, solidary with the robot frame, and is tracked by an external camera, using the ARToolKit Toolbox1 , to provide a groundtruth for the robot pose in the
Journal of Automation, Mobile Robotics & Intelligent Systems
VOLUME 7,
Nâ—Ś 1
2013
environment. A calibration to transform the 3D poses, provided by the ARToolKit, to 2D poses, for comparisons with the groundtruth, was also performed. 11 shows the results of the algorithm in comparison with odometry alone, for the whole path of an experiment, in which the robot is supposed to follow a triangular route. It is observed that while the algorithm estimations maintain the robot localized in a triangular route, the odometry based estimator progressively diverges from the route.
Fig. 11. Path estimation with the developed localization system (green/lighter) against the odometry alone (red/darker), for an experiment in which a triangular route is performed.
Fig. 12. Estimation error quantification of the algorithm along the robot’s path. a) Position error. b) Orientation error.
Figs. 12 and 13 show the Euclidean distance and the orientation errors between the algorithm’s and the groundtruth pose estimations, throughout the whole path of an experiment. The last figure corresponds to a kidnapping situation. The marked points correspond to instants where the algorithm used barcode information for pose update. In normal conditions (Fig. 12) the localization error is shown to be smaller than the robot radius. In the kidnapping situation, the results show that the algorithm can recover from wrong estimations provided very few barcode observations (in this case two to four). One can also observe that there is normally a high error decrease when pose correction events occur due to barcode observations, which shows the high accuracy of the barcode related measurements.
5. Conclusions In this paper we described a localization system onboard low capability robots (8 KB memory and 16 MIPs processor). The proposed system was implemented in the e-puck, having shown to converge with an error of about just 1.5 cm in average, for the barcode, robot and scenario specifications used in the experiments. Further experiments are foreseen using an autonomous navigation controller on the robot, relying on the localization results provided by the developed system. Future developments will target common situations where only a portion of the barcode is inside the image, which prevents barcode observations and can possibly lead the robot off the navigation track. Also, a study of the optimal positioning of the barcodes in the environment will be considered, using the developed mathematical framework. A new scenario will be used to translate an example of a real case scenario where this localization
Fig. 13. Estimation error quantification for a kidnapping situation. The robot starts with a wrong pose estimate, and in the middle of the experiment a kidnap is also applied. a) Position error. b) Orientation error. system should be deployed.
Notes 1 URL: http://www.hitl.washington.edu/artoolkit/ (Retrieved on 1 January 2013).
Articles
33
Journal of Automation, Mobile Robotics & Intelligent Systems
AUTHORS Duarte Dias∗ – Institute for Systems and Robotics, Instituto Superior Técnico, TULisbon, Portugal, e-mail: duarte.dias@ist.utl.pt Rodrigo Ventura – Institute for Systems and Robotics, Instituto Superior Técnico, TULisbon, Portugal, e-mail: rodrigo.ventura@isr.ist.utl.pt ∗
Corresponding author
Acknowledgements This work was supported by the FCT project [PEstOE/EEI/LA0009/2011].
References [1] J. N. Pereira, A. Christensen, P. Silva, and P. Lima, “Coordination through institutional roles in robot collectives”. In: Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems, van der Hoek, Kaminka, Lespérance, Luck and Sen (eds.), May 2010, pp. 1507–1508. [2] L. Bayindir and E. Sahin, “A review of studies in swarm robotics”, Turkish Journal of Electrical Engineering and Computer Sciences, vol. 15, no. 2, 2007, pp. 115–147. [3] F. Mondada, M. Bonani, X. Raemy, J. Pugh, C. Cianci, A. Klaptocz, S. Magnenat, J. C. Zufferey, D. Floreano, and A. Martinoli, “The e-puck, a robot designed for education in engineering”. In: Proceedings of the 9th Conference on Autonomous Robot Systems and Competitions, May 2009, pp. 59–65. [4] M. Nasser, “On E-Puck Mobile Robots for Distributed Robotics”, ser. National University of Singapore. [5] S. Slusny, R. Neruda, and P. Vidnerová, “Localization with a low-cost robot”. In: Proceedings of the Conference on Theory and Practice on Information Technologies. CEUR Workshop Proceedings, September 2009, pp. 77–80. [6] S. Grzonka, W. Burgard, C. Stachniss, and G. Grisetti, “Introduction to mobile robotics: Localization with a low cost robot”, 2008, University of Freiburg. [7] R. Negenborn, “Robot Localization and Kalman Filters: On finding your position in a noisy world”, ser. Utrecht University, Utrecht, Netherlands, September 2003. [8] A. Burian, M. Vehvilainen, J. Kangas, “Camera barcode reader with automatic localization, detection of orientation and type classification,” Computers and Simulation in Modern Science, vol. 1, 2008, pp. 214– 219. [9] D. Chai, F. Hock, “Locating and decoding EAN-13 barcodes from images captured by digital cameras”. In: 5th Inernational Conference on Information, Communications and Signal Processing, December 2005, pp. 1595–1599. [10] J. C. Rocholl, S. Klenk, and G. Heidemann, “Robust 1D barcode recognition on mobile devices”. In: 34
Articles
VOLUME 7,
N◦ 1
2013
20th Inernational Conference on Pattern Recognition, August 2010, pp. 2712–2715. [11] L. Fang and C. Xie, “1-D barcode localization in complex background”. In: 2nd International Conference on Computational Intelligence and Software Engineering, December 2010. [12] K. Wang, Y. Zou, and H. Wang, “1D bar code reading on camera phones”, International Journal of Image and Graphics, vol. 7, no. 3, 2007, pp. 529–550. [13] S. Wachenfeld, S. Terlunen, X. Jiang, “Robust recognition of 1-D barcodes using camera phones”. In: 19th Inernational Conference on Pattern Recognition, December 2008.
Journal of Automation, Mobile Robotics & Intelligent Systems
VOLUME 7,
Nâ—Ś 1
2013
Toward Rich Geometric Map for SLAM: Online Detection of Planes in 2D LIDAR Received 10th October 2012; accepted 22nd November 2012.
Cyrille Berger Abstract: Rich geometric models of the environment are needed for robots to carry out their missions. However a robot operating in a large environment would require a compact representation. In this article, we present a method that relies on the idea that a plane appears as a line segment in a 2D scan, and that by tracking those lines frame after frame, it is possible to estimate the parameters of that plane. The method is divided in three steps: fitting line segments on the points of the 2D scan, tracking those line segments in consecutive scan and estimating the parameters with a graph based SLAM (Simultaneous Localisation And Mapping) algorithm. Keywords: laser scanner, SLAM, LIDAR
1. Introduction 1.1. Environment Modelling Accurate and rich environment models are essential for a robot to carry out its mission. However those models are not always available to the robot, and if they are available (through geographic information database) their information might be inaccurate and outdated, for instance, due to the consequences of a catastrophe, where part of the environment has been destroyed, but it also happen continuously, for instance, when new building are constructed or demolished... This raise the need for a robot to be able to construct environment model using its own sensor, and at the same time, this model can be used to improve the knowledge of the robot position. This is the problem known as Simultaneous Localisation and Mapping (SLAM) [1, 5]. However most research around SLAM have focused around providing improvement to the robot localisation, by tracking features in the environment. For that purpose, using a sparse model of the environment (such as interest points) is generally sufficient. On the other hand, many techniques used for 3D reconstruction assumes that a very accurate localisation is available (generally through GPS like systems), which is not necessary practical for a realtime robotic system and also they work offline. This is why, in our research, we are interested in using SLAM techniques to construct rich and dense model of the environment, this have the advantage of improving the localisation of the robot and the quality of the model, as well as creating the model in real-time. To create a model of the environment, a robot can use different sensors: the most popular ones are camera [12], LIDAR [18], RADAR [8] or multiple sensors [7] are combined. Cameras are very cost effective, they are also cheap and provide the most information, since it is possible
to recover structure and appearance, however they are less accurate than LIDAR and RADAR, and they are not very effective at recovering the structure from a distant view point, which is especially important when used on board of an unmanned aircraft. 1.2. Using a 2D LIDAR for 3D modelling Most approaches that use LIDAR start by generating a cloud of points, combined with a shape registration technique [19], to correct uncertainty on the localisation of the robot, and this is usually result in a very accurate model. While a cloud of points is a dense model of the environment, the amount of information it contains make it difficult to use in practice. It needs to be transformed into a different representation, that will abstract the model, for instance, points cloud are commonly used to create grid occupancy [17], used for collision detection. Also a cloud of points, in itself, contains very little information on the actual structure of the environment. One of the solution used to get the structure has been to combine the information coming from the cloud of points with images [13]. An other solution is to transform the cloud of points into higher level of geometric information. Classification process like the hough transform [9, 16] can be used to achieve that goal, making it possible to process the LIDAR data in high level geometric features such as cube, cylinder or sphere [18]. Not only the use of higher level of geometric information allow to create more compact model, which reduce the amount of memory and computing power necessary to use them, but they allow other application such as object recognition, matching information coming from different sensors and with existing schematics [2]. However the use of a cloud of points to generate a high level geometric representation requires to wait until the robot has finished to acquire the data, before the robot can construct the model. But we are interested in an online and incremental approach. As shown in [6] and figure 1, in a 2D scan of a LIDAR sensor, a plane appears as a line segment, which makes it possible to extract planes in the environment directly from those scans, by simply tracking the line segment, scan after scan, the aggregation of those scans allow to recover the parameters of the plane. For such a system to work, it is essential that the plane of the 2D scans are not parallel to the displacement of the robot, otherwise, the rays of the LIDAR will always hit the plane at the same place, which would not provide any information on the 3D structure. In this article, we present a method for extracting line segment from the 2D scan, tracking them between two Articles
35
Journal of Automation, Mobile Robotics & Intelligent Systems
Fig. 1. The left picture show the 3D view of a scan of a two buildings, while the right figure shows how it display in 2D.
Fig. 2. This figure shows an unlikely example where the assumption that the tracked segment belongs to the same plane is wrong. The robot is flying exactly perpendicularly to the house, meaning that the observation on the two side of the roof appears parallel. The figure on the top left shows the model of the house (the red lines are scan rays), on the top right the robot has scanned only half of the house, and the resulting plane is still correct, on the bottom right, the robot has finished flying over the house (the black line illustrates the constraints). The bottom left image shows that if the house has a 2◦ angle with the sensor then the roof is correctly detected as two planes.
frames, and then the line segment is treated as an observation of the plane within the graphical SLAM framework, using a weak constraint network optimiser (WCNO) [3]. Our main contributions are the error and observation model used for WCNO as well as the overall combination of algorithms to extract planes from 2D scan. The first section describes the line extraction which is a classical split-and-merge [10] and the line tracking. The second section explained the error and observation models used in WCNO. The last section provides simulation and real data results.
2. Detecting planes in the LIDAR data As mentioned in the introduction, when a 2D LIDAR scan hit a plane, the points from the plane will appear as a line segment in the 2D scan. Our assumption is that whenever there is a line segment in the 2D scan, it does belong to a plane, especially, if it is tracked between two 36
Articles
VOLUME 7,
N◦ 1
2013
Fig. 3. This figure shows the angle used for comparing two segments during tracking. ρx and θx are the parameters of a line while θx,1 and θx,2 are the angles of the extremities of the segment. consecutive frames. However, as shown in Figure 2, there are corner cases where this assumption that two consecutive segments belong to the same plane does not hold. This case happens if the robot is precisely on a trajectory that is perpendicular to the intersection line between two planes or perpendicular to a cylinder, however, we argue that those cases are unlikely to happen in real life, since it would require that the robot is precisely on that trajectory, as a slight movement off that trajectory and the algorithm will detect a break in the line segment, as shown on the bottom left image of Figure 2. Furthermore, using the graph optimisation technique it is possible to detect those corner cases, since they will fail to optimise. The algorithm to detect planes works in two steps, first the scans of the LIDAR data are converted into line segments and then those line segments are tracked with the previous plane. Those line segment are then used as observation of a plane. 2.1. Extracting line segments The best method to transform a set of points from a LIDAR scan is the split-and-merge algorithm [10]. Then the parameters of each line segment are optimised with a weighted line fitting algorithms [11] to provide a better fit on the points. 2.2. Tracking line segments Segments are tracked between two consecutive frames. By comparing the distance to the origin (ρ), and the angle of the line (θ) and checking whether the segments are overlapping, see Figure 3. Given St = st0 ...stm the set of segments at time t and t−1 St−1 = st−1 at the previous time t − 1. The goal 0 ...sn of the tracking algorithm is to find (k, l) such as stk ∈ St is collinear and overlapping with st−1 ∈ St−1 . l Given two line segments a and b of parameters (ρa , θa ) and (ρb , θb ), they are considered collinear if and only if: cos(θa − θb ) > cos(θmax ) |ρa − ρb | < ρmax
(1) (2)
Furthermore, if two collinear line segments a and b have the extremities (θa,1 , θa,2 ) and (θb,1 , θb,2 ), then the following conditions guarantee that they are overlapping: θa,1 ≤ θb,1 ≤ θa,2 or θa,1 ≤ θb,2 ≤ θa,2 or θb,1 ≤ θa,1 ≤ θb,2 or θb,1 ≤ θa,2 ≤ θb,2
(3)
Journal of Automation, Mobile Robotics & Intelligent Systems
VOLUME 7,
N◦ 1
2013
Given a segment stk ∈ St , St−1 (stk ) is the subset of segments of St−1 that satisfy the collinearity conditions (equations 1 and 2) and the overlapping conditions (equation 3). If St−1 (stk ) = ∅, then stk correspond to a new plane, otherwise, it is necessary to select a segment st−1 ∈ St−1 l that will be the tracked segment for stk , and that will be used as the new observation of a plane. st−1 is defined l such as: ∀st−1 ∈ St−1 (stk ) |ρtk − ρt−1 |< |ρtk − ρt−1 |< ρmax (4) i i l Using a prediction of the parameters of the segment, based on the motion of the robot (using the SLAM algorithm) or using the evolution of the parameters of the tracked segment, improve the results of the tracker, it also allow to limit the problem where the assumption that a line is a plane is wrong (see Figure 2). 2.3. Detecting and tracking planes with a 2D LIDAR sensor The algorithm for detecting and tracking planes in consecutive scan from a 2D LIDAR sensor follow the following steps: 1) At time t, extract the segments St = st0 ...stm in the scan, ∈ 2) for each segment stk ∈ St , find the segment st−1 l St−1 (stk ) that fulfil the equation 4 – if there is no such segment st−1 , then stk is the bel ginning of a new plane, – if there is a segment st−1 , then st−1 is used as a new l l observation of the plane associated with stk , 3) for each segment st−1 ∈ St−1 , which has not been sel lected as a tracked segment for any stk ∈ St , check if the associated plane has enough observations, otherwise, remove it from the map. The next section is going to explain how the parameters of a plane are estimated.
3. Optimisation model The estimation of the parameters of the plane and the position of the robot is done using a weak constraints network optimisation (WCNO) as described in [3]. This algorithm works by optimising a graph of constraints, where the nodes are objects (robot poses and landmarks) and the edges are the observations (coming from various sensors: odometry, LIDAR, …). The key features of WCNO is that it allows to use partial observations, which is interesting for estimating the plane parameters as observed with a 2D LIDAR, since in such a case, only a few parameters of the plane are observed at each time. WCNO relies on the steepest gradient descent and it requires the definition of an object model and of a constraint model. Three functions are needed for the object model: – initialisation this function uses one or several constraints to compute a first estimation of the parameters of an object, as well as the associated uncertainty, in case of a plane, a plane perpendicular to the first observation is used, – rotation update this function rotates the object on itself,
Fig. 4. The left figure show the feature model for a plane, represented using spherical coordinates (ρ, θ, φ) expressed in the frame Fp . The right figure represent the observation and feature model for a plane, the observation in frame Fj ⊥ ⊥ is the dark red segment Pi,1 Pi,2 , while Pi,1 and Pi,2 are respectively the projection of Pi,1 and Pi,2 on the plane Πj . Fa is the common ancestor in the tree of Fj and Fp . – translation update this function translate the object. Also for the constraint between the frame and the plane, three other functions are required: c – rotation error rˆi,j – this function computes the rotation error between the frame and the plane,
– translation error tˆci,j – this function computes the translation error between the frame and the plane, – constraint observation model h(Oi , Oj ) – it is the function that given the state of a frame and a plane, computes the segment that should be observed in the frame. The work [3] includes more details on how to define the functions for the node and constraint models, it also includes the node use for robot frames and the constraint model used for odometry. The rest of this section will detail the object model for planes as well as the constraint model used with the plane detection algorithm provided in the section 2. Figure 4 show the convention used for the spherical coordinates (ρ, θ, φ), where ρ is the distance of the plane to the origin, θ is the rotation of the normal around the axis Oy and φ around Oz. We assume the scanner is in the plane Oxy, if that is not the case, one can simply add a frame for the sensor expressed in the robot frame with the transformation between the robot and the frame in the graph. 3.1. Feature model The plane is represented using spherical coordinates, with the parameters ρ, θ and φ (Figure 4). Where ρ, θ and φ are the coordinates of a point of the plane and θ and φ are the angles of the normal N to the plane. A polygon representing the boundary of the plane is associated to the model, this boundary has little use for the purpose of estimating the parameter of the plane, but can be used for display purposes or in the matching process [2]. Rotation Update The rotation update is simply given by applying the rotation on the normal vector: Nn = Rup · Nn−1
(5)
ρ remains constant. Nn is the normal of the plane and Rup is the rotation update. Articles
37
Journal of Automation, Mobile Robotics & Intelligent Systems
VOLUME 7,
N◦ 1
2013
Translation Update The translation of an infinite plane only applies a change in the direction parallel to the normal, with a spherical representation, only the distance ρn to the origin is affected: ρn = ρn−1 + hNn , tn−1 i
(6)
3.2. Constraint model Given a frame Fi and a plane Πj , they are connected in the graph by a constraint which is the line segment that was detected and tracked in section 2. The constraint between Fi and Πj is represented by the ⊥ extremities of the observed segment: Pi,1 and Pi,2 , Pi,1 ⊥ and Pi,2 are respectively the projection of Pi,1 and Pi,2 on the plane Πj (see Figure 4). Rotation error The rotation error is the minimal rotation that ensure the observation becomes parallel to the plane. This rotation is the rotation between the vector ⊥ Pi,1 Pi,2 and P⊥ i,1 Pi,2 : u1 = u2 =
Pi,1 Pi,2 kPi,1 Pi,2 k
(7)
⊥ P⊥ i,1 Pi,2
(8)
⊥ kP⊥ i,1 Pi,2 k
c rˆi,j = Ri · cos−1 (hu2 , u1 i) , u2 ∧ u1 · Ri−1
(9)
Boundaries of the plane The extremities of the observation are projected on the plane, and then it is possible to recover a boundary [4]. Since the boundary is only used for display, and for simplicity, in our experimentation we have been using the convex hull of the projection of the extremities.
where Ri is the rotation of the frame Fi .
4. Experiments
Translation update The translation update minimise the distance between the points of the scan and the plane. It is therefore defined as:
We have used the algorithm in both simulation and real data experiments. The simulation allow to check that the algorithms are able to make an accurate estimation of the parameters while the real data allow to validate the usefulness of the algorithm on a real system.
tˆci,j =
⊥ Pi,1 P⊥ i,1 + Pi,2 Pi,2 2
(10)
Recover plane parameters However the steepest gradient descent algorithm as described in [3] does not allow to recover the normal of the plane correctly, since it would require to define a rotation of the plane, but that would raise the question of the choice of an axis and of where to apply this rotation. Instead, the least square optimisation is applied, combined with the “rotation update” and the “the translation update” this should ensure that the observation points get coplanar. Observation model The observation model can be used for predicting the parameters of the line segment that needs to be tracked in section 2, and it is also used by WCNO to update the uncertainty of the plane Πj and of the frame Fi . It is defined as the intersection of the plane of the 2D LIDAR sensor expressed in the frame Fi with the plane Πj . The following formula assumes that the 2D LIDAR sensor is in the plan Oxy of Fi :
obs hi,j = (ρobs i,j , φi,j ) = (
ρij , φi ) sin(θji ) j
(11)
where ρij , θji and φij are the coordinates of the plane Πj expressed in the frame Fi . 38
Fig. 5. This figure shows the algorithm at work on a simulated world. The transparent grey building are the ground truth, while the cyan represented the estimated planes. On the bottom left image, one can see that the detection of plane breaks on face with few hit points, and on the bottom right images, the roof of the building get a better estimation that the walls.
Articles
4.1. Simulation Flying over a set of buildings In this experiment, an aircraft is flying over a set of buildings, while the LIDAR is pointing down. This is actually a very classical set-up for generating a model of the environment from an aircraft. The Figure 5 shows screenshots of the results. For a total of 54 detected planes, the average error on ρ is 0.14 m, with a maximum of 0.77 m, while the average error on the angle of the normal is 3.88◦ with a maximum at 17.73◦ . Effect of correction from plane Since we are using a 2D LIDAR sensor, the observation of planes in the environment does not allow to recover the full 3D transformation between two frames. In fact, as shown in Figure 6, the observation of a plane provide a translation that is perpendicular to that plane, as well as a rotation whose axis is parallel to that plane. So for a plane that is perpendicular to the axis Ox, if the 2D scan is in the plane Oxy, then the observation of that plane provides a translation correction in the direction of Ox and a rotation correction around the axis Oz. In order to illustrate the correcting effect brought by the plane, in simulation, we have made an experiment, where the robot is observing a horizontal plane and the scan laser has an angle of 45◦ with both the robot frame and the plane. For the first three positions, the odometry
Journal of Automation, Mobile Robotics & Intelligent Systems
VOLUME 7,
N◦ 1
(a). before optimisation (error in roll)
(b). after optimisation
(c). before optimisation (error in z)
(d). after optimisation
2013
Fig. 6. This figure shows the correction effect from observing a plane with a 2D LIDAR sensor. The red lines shows the observed constraints, while the purple lines shows the ideal position of the constraint. The contribution of the planes Π1 and Π2 to the correction are the translations T1 and T2 and the rotations (Oz, θ1 ) and (Oz, θ2 ).
5. Conclusions and Future Works We have presented an algorithm that extract observations of a plane using 2D scan from a LIDAR sensor, this algorithm works by fitting line segments on the points of the scan, and then tracking the segments frame after frame. Then those observations are used in a graph based framework to estimate the parameters of the plane and improve the localisation of the robot. We have shown through simulation results, that the method is capable to accurately estimate the parameters of
Fig. 7. This figure show that the optimisation process is capable of recovering a large error on the estimation of the roll and z parameters. The black line represent the constraints between node. In particular, the black triangles are the observations coming from the lidar. The red polygon is the plane, the frame represents the robot and sensor position, while the cyan polygon is the estimated plane. On the first three robot pose, the estimation of the odometry is accurate, while on the fourth pose, the error has been exaggerated (in roll for the top row of images and in z for the bottom row. 0.5
50
0.4
40
0.3
30
0.2
20
0.1
10
0
error (m)
roll error z error
z
4.2. Ground robot We have also run the algorithm using real data, like the New College data set [15], which was created using a robot, with a vertical laser on the side of the robot. As can be seen in Figure 9, the algorithm has no problem with the detection of the ground. However, the detection of walls is often more split, this is due to the presence of vegetation in front of the building, and also because the building facade is not completely flat. For comparison purposes, using the same sequence, a points cloud was generated, then a RANSAC technique was applied using the Points Cloud Library [14] to cluster the points in function of their coplanarity. The results are shown in Figure 10. The first point to note is that the result of the classification is unfiltered, meaning that points that do not really belong to an actual plane are still shown. Also, it is worth to note that the resulting segmentation is rather unstructured meaning that points that are classified as belonging to the same plane might actually be very far apart, this effect is particularly visible in the bottom Figure 10, where several plane are juxtaposed.
roll error (rad)
has a high accuracy, this allow to get a good estimate of the parameters of the plane, however, for the fourth step, one of the value coming from the odometry is defined with a high uncertainty, with either an error of 50 m in z or 22◦ in roll. The results are shows in Figures 7 and 8. While the experiment in itself is unrealistic, it does show the benefit of using an iterative approach to estimate the parameters of the planes in the environment. And this is especially interesting for an aircraft, since GPS units have usually a poor estimate of the altitude, the detection of plane would allow to correct that information.
0 1
iterations
51
Fig. 8. This figure show the evolution of the correction from the first iteration to iteration 50. The number of iterations needed to correct the angle is much smaller than for correcting the error in translation.
the planes, and we have shown that the method works with real data. Articles
39
Journal of Automation, Mobile Robotics & Intelligent Systems
VOLUME 7,
N◦ 1
2013
Fig. 9. Mapping using the New College dataset [15], the top image shows a view of the environment, the middle image shows half way through the loop, the bottom image show the resulting map.
A needed improvement to the algorithm itself, would be to get a better method to estimate the boundary of a plane, this would help to get a more representative limit of the plane, for visualisation, or for use by geometric-based loop closure algorithms [2]. While the method we have presented allow to detect planes, with a compact representation, which is useful in urban-like environment, it does not allow a full representation of the environment, especially, since not every objects in the environment is a plane. In future work, it would be interesting to investigate mixing other type of objects to get an even richer model environment, such as using cylinder and ellipse, or simply by adding local occupancy grid using the points that are not transformed into planes.
AUTHOR Cyrille Berger – Department of Computer and Information Science,University of Linköping, SE-581 83 LINKÖPING, Sweden, e-mail: cyrille.berger@liu.se
40
Articles
Fig. 10. This figure shows the segmentation of a points cloud accumulated on the New College data [15], using a RANSAC segmentation of the points cloud with the PCL library [14]. The top two figures use a precision of 1 cm while the bottom two use a precision of 10 cm. Points that belongs to the same plane are coloured using the same colours.
Journal of Automation, Mobile Robotics & Intelligent Systems
References [1] T. Bailey, H. Durrant-Whyte, “Simultaneous Localisation and Mapping (SLAM): Part II – State of the Art”, Robotics and Automation Magazine, September 2006. [2] C. Berger, “Perception of the environment geometry for autonomous navigation”, PhD thesis, University of Toulouse, 2009. [3] C. Berger, “Weak constraints network optimiser”. In: IEEE International Conference on Robotics and Automation, 2012. [4] M. Duckham, L. Kulik, M. Worboys, A. Galton, “Efficient generation of simple polygons for characterizing the shape of a set of points in the plane”, Pattern Recognition, vol. 41, no. 10, 2008, pp. 3224–3236. [5] H. Durrant-Whyte, T. Bailey, “Simultaneous Localisation and Mapping (SLAM): Part I – The Essential Algorithms”, Robotics and Automation Magazine, June 2006. [6] M. Hebel, U. Stilla, “Pre-classification of points and segmentation of urban objects by scan line analysis of airborne lidar data”. In: International Archives of Photogrammetry, Remote Sensing and Spatial Information Sciences, vol. 37, 2008, pp. 105–110. [7] J.H. Joung, K.H. An, J.W. Kang, M.J. Chung, W. Yu, “3d environment reconstruction using modified color icp algorithm by fusion of a camera and a 3d laser range finder”. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, 2009, pp. 3082– 3088. [8] M. Kirscht, C. Rinke, “3d reconstruction of buildings and vegetation from synthetic aperture radar (sar) images”. In: IAPR Workshop on Machine Vision Applications, 1998, pp. 17–19. [9] M. Morgan, A. Habib, “Interpolation of lidar data and automatic building extraction”. In: American Society of Photogrammetry and Remote Sensing Conference, 2002. [10] V. Nguyen, S. Gächter, A. Martinelli, N. Tomatis, R. Siegwart, “A comparison of line extraction algorithms using 2d range data for indoor mobile
VOLUME 7,
N◦ 1
2013
robotics”, Autonomous Robots, vol. 23, no. 2, 2007, pp. 97–111. [11] S.T. Pfister, S.I. Roumeliotis, J.W. Burdick, “Weighted line fitting algorithms for mobile robot map building and efficient data representation”. In: IEEE International Conference on Robotics and Automation, 2003, pp. 14–19. [12] T. Rodriguez, P. Sturm, P. Gargallo, N. Guilbert, A. Heyden, J.M. Menendez, and J.I. Ronda, “Photorealistic 3d reconstruction from handheld cameras”, Machine Vision and Applications, vol. 16, no. 4, 2005, pp. 246–257. [13] F. Rottensteiner, J. Jansa, “Automatic extraction of buildings from lidar data and aerial images”. In: International Society for Photogrammetry and Remote Sensing Symposium, 2002. [14] R.B. Rusu, S. Cousins, “3D is here: Point Cloud Library (PCL)”. In: IEEE International Conference on Robotics and Automation (ICRA), Shanghai, China, May 9–13, 2011. [15] M. Smith, I. Baldwin, W. Churchill, R. Paul, P. Newman, “The new college vision and laser data set”, The International Journal of Robotics Research, vol. 28, no. 5, May 2009, pp. 595–599. [16] G. Vosselman, S. Dijkman, “3d building model reconstruction from point clouds and ground plans”. In: International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, vol. 34, pp. 37–44. [17] K.M. Wurm, A. Hornung, M. Bennewitz, C. Stachniss, W. Burgard, “Octomap: A probabilistic, flexible, and compact 3d map representation for robotic systems”. In: ICRA Workshop, 2010. [18] S. You, J. Hu, U. Neumann, P. Fox, “Urban site modeling from lidar”. In: International Conference on Computational Science and its Applications, ICCSA’03, Springer-Verlag, Berlin, Heidelberg 2003, pp. 579– 588. [19] Z. Zhang, “Iterative point matching for registration of free-form curves and surfaces”, International Journal of Computer Vision, vol. 13, 1994, pp. 119–152.
Articles
41
Journal of Automation, Mobile Robotics & Intelligent Systems
VOLUME 7,
N◦ 1
2013
Understanding 3D Shapes Being in Motion Received 10th October 2012; accepted 22nd November 2012.
Janusz Bedkowski Abstract: This paper concerns a classification problem of 3D shapes being in motion. The goal is to develop the system with real-time capabilities to distinguish basic shapes (corners, planes, cones, spheres etc.) that are moving in front of RGB-D sensor. It is introduced an improvement of SoA algorithms (normal vector computation using PCA Principal Component Analysis and SVD Singular Value Decomposition, PFH – Point Feature Histogram) based on GPGPU (General Purpose Graphic Processor Unit) computation. This approach guarantee on-line computation of normal vectors, unfortunately computation time of the PFH for each normal vector is still a challenge to obtain on-line capabilities, therefore in this paper it is shown how to find a region of movement and to perform the classification process assuming the decreased amount of data. Proposed approach can be a starting point for further developments of the systems able to recognize the objects in the dynamic environments. Keywords: RGB-D camera, point cloud, normal vector estimation, point feature histogram, parallel programming
1. Introduction Understanding shapes being in motion is still an open problem in mobile robotics especially if we would like to perform this task on-line. Understanding changes in dynamic environment is still a problem not only because of the limitation of the available sensors but also because of the computational complexity of algorithms performing this task. Currently there are available sensors (3D laser VELODYNE or RGB-D camera like KINECT) that can provide accurate 3D data for INDOOR and OUTDOOR environments. This sensors provide data on-line, therefore it could be possible to detect changes in the environment in short intervals of time. Unfortunately there is no existing approach to perform computation on such data, therefore it is proposed to improve State of The Art by the usage of NVIDIA GPGPU with CUDA (Compute Unified Device Architecture). Using GPGPU is a promising choice because it can run thousand of kernels (functions performed on GPGPU) using decomposed 3D data set. The problem of understanding 3D shapes being in motion can be decomposed, therefore we can perform computation for each single 3D point in parallel. Unfortunately it is possible only for normal vector computation and motion-detection, the Point Feature Histogram (PFH) is still demanding problem because of its computational complexity.
2. Related Work Currently we can observe numerous research related with modern RGB-D KINECT like sensors. In [1] Kurt3D, 42
Articles
data from KINECT sensor is shown, which gives an impression that State of the Art (SoA) already offers efficient mobile platforms equipped with advanced sensors for observing dynamic environments. Semantic objects identification [2] is well known research direction, unfortunately currently related with the static environments [3], usually extracted from 3D laser data [4]. In [5] a model of an indoor scene is implemented as a semantic net, this approach is also used in [6]. In [7] the location of features is extracted by using a probabilistic technique (RANSAC: RANdom SAmple Consensus) [8]. It was shown in [9] that the region growing approach, extended from [10] using k-nearest neighbor (KNN) search, is able to process unorganized point clouds. The research presented in this paper is inspired by the work [11], where authors were able to recognize several shapes such as corners, cylinders, edges, planes, torus etc. They proposed Point Feature Histogram(PFH) technique to distinguish different classes. Unfortunately the performance of proposed Point Feature Histogram algorithm was not satisfactory, therefore authors proposed faster approach – Fast Point Feature Histogram (FPFH) [12]. Based on The State of the ART it is claimed that the problem of understanding shapes being in motion is not yet discussed, therefore it was decided to develop system that can be a starting point to solve mention problem.
3. Normal Vector Estimation Estimating the surface normal is performed by PCA – Principal Component Analysis (Figure 1) of a covariance matrix C created from the nearest neighbors of the query point. The main contribution was to develop the PCA solver based on Singular Value Decomposition SVD method that can be performed in parallel for each query point at one single step. In last step of the algorithm it is checked if normal vectors are consistently oriented towards the viewpoint and flip otherwise.
Fig. 1. Estimating the surface normal is performed by PCA – Principal Component Analysis of a covariance matrix C created from the nearest neighbors of the query point.
Journal of Automation, Mobile Robotics & Intelligent Systems
3.1. Parallel Implementation Figure 2 demonstrates the parallel implementation of normal vector estimation in NVIDIA CUDA framework. The idea was to perform normal vector estimation in two steps assuming performing computation for each query point in parallel. First step is to compute covariance matrices and store the result in the GPGPU’s shared memory. In second step normal vector estimation is performed using SVD (Singular Value Decomposition) method for each query point in parallel. The main contribution was to implement the CUDA kernels for covariance matrix computation and SVD solver. It is important to emphasize the NNS (Nearest Neighborhood Search) method in this approach. It is used the fixed organization of RGB-D data (640x480 or 320x240 data points), therefore NNS (Nearest Neighborhood Search) is performed for neighbors assigned by neighboring indexes to the index of query point. Additional threshold determines the radius of search space.
VOLUME 7,
N◦ 1
2013
4.1. Parallel Implementation To improve the performance of Point Feature Histogram (PFH) computation CUDA based parallel computation is used. Current implementation computes 64 histograms at single step. Figure 4 shows an idea. For the quantitative comparison with State of the Art (PCL – Point Cloud Library) the PFH is composed of three features, therefore the dimension is 125 (5x5x5).
Fig. 4. The idea of parallel implementation of 64 PFH’s computation at single step. The maximum amount of nearest neighbors to a query point is 1024. It has to be performed the computation for each pair of normal vectors in neighborhood ( [11]), therefore kernels are organized into 1024x1024x64 data structure.
Fig. 2. Parallel implementation of normal vector estimation in NVIDIA CUDA framework.
4.2. PFH classification The method used for classification is K-Means clustering, where an iterative search for the optimal k clusters is performed. Histograms that are belonging to the same class tend to be grouped together in the same cluster. The cluster label can be verified by looking at its proximity to the mean histogram of the proper shape. This approach guarantee finite amount of computation steps to obtain candidate labels. The disadvantage is the low classification result (60% – 70%), but it is not considered as a problem because in general we deal with on-line classification process.
5. Motion Detection 4. Point Feature Histogram A histogram of values encodes the local neighborhood’s geometrical properties by generalizing the mean curvature at a point p. This method provides an overall density and pose invariant multi-value feature [11]. Figure 3 demonstrates the PFH – Point Feature Histogram for query point. This method provides a possibility to distinguish several types of shapes such as plane, cylinder, corner, sphere etc.
In the presented approach it is assumed that RGB-D sensor is static during motion detection. Motion detection is performed by comparing current RGB-D frame to the previous one. If there is a large difference in range for corresponding query points or large angle between corresponding normal vectors it is obtained the region of motion. This procedure is extremely fast and does not affect the classification procedure in the sense of computational time. Figure 5 demonstrates the ball being in motion and corresponding computed region of motion (red color).
6. Experiments
Fig. 3. Normal vector estimation visualization and Point Feature Histogram for query point.
6.1. Classifier Training and Testing To train the system we have to provide training data set composed of labeled histograms (supervised learning). The method used for classification was introduced in section 4.2. The method is using an iterative search for the optimal k cluster, therefore histograms that are belonging to the same class tend to be grouped together are are marked by the same label. During the training of the system user Articles
43
Journal of Automation, Mobile Robotics & Intelligent Systems
VOLUME 7,
Nâ&#x2014;Ś 1
2013
Fig. 5. Motion detection, red color corresponds to the region of motion.
is marking similar histograms by the same label. Figure 6 demonstrates the idea of collecting training data with the help of developed HMI (Human Machine Interface). The HMI is designed especially for supervised learning purpose. Example of histograms taken to the classifier training procedures are marked by pink color in Figure 6. Bottom picture shows classification result as different colors/labels for different histograms.
7. Quantitative Comparison with State of The Art The goal of this paper is to compare quantitatively the performance of open source library PCL (Point Cloud Library) and proposed parallel implementation called cuPCL. The main problem for CUDA computation is a bottleneck related to the copy data from/to host to/from device. The host is related with CPU, device is related with GPGPU. The implementation is dedicated for NVIDIA FERMI architecture with an advantage of double floating point precision capability. Figure 7 shows the comparison between PCL and cuPCL of normal vector estimation of depth image 640x480 vertexes. The performance is measured for different radius of NNS (Nearest Neighborhood Search) procedure, it is obvious that with more radius we should expect more time for computation. The reasonable radius is 3 cm and for this value the cuPCL speed up over PCL is over 16 (GPGPU GF540M). Based on the previous observation it is proposed to decrease amount of data from 640x480 to 320x240 of processed points. Figure 8 shows the comparison between PCL and cuPCL of normal vector estimation of depth image 320x240 vertexes. The speed up is over 10 for radius 3 cm. It is important to emphasized the decreased variance of computational time for cuPCL what is an optimistic observation to build real time systems. Last experiment concerns the qualitative comparison of PFH (Point Feature Histogram) computation. We can observe satisfactory speed up (over 40 for radius 3 cm) for cuPCL what can give an impression that it is possible to build on-line system. Unfortunately this figure demonstrates the performance for computing 64 histograms in a single step. To perform more PFH computation we need more GPGPUs, what can be considered as a limitation for modern mobile platforms. 44
Articles
Fig. 6. Three training and one testing data sets. Histograms taken to the classifier training procedures are marked by pink color. Bottom picture shows classification result as different colors/labels for histograms of different classes (for example brown color corresponds to the wall).
8. Conclusion In this paper new implementation capable to classify shapes being in motion is proposed. The system is able to detect regions of motion and to perform classification of 64 Point Feature Histograms on-line (between 2 and 10 frames per second, depends on GPGPU). The improvement is based on NVIDIA CUDA implementation of the algorithms that are available in State of The Art PCL (Point Cloud Library) library. The main contribution of this paper are new method for detecting motion, new parallel implementation of SVD (Singular Value Decomposition) solver for PCA (Principal Component Analysis) analysis related with normal vector estimation. The system can classify up to 64 shapes being in motion on-line with the 60% â&#x20AC;&#x201C; 70% of classification result. It can process more histograms but
Journal of Automation, Mobile Robotics & Intelligent Systems
VOLUME 7,
N◦ 1
2013
AUTHOR Janusz Bedkowski – Intitute of Mathematical Machines, Institute of Automation and Robotics, Warsaw, Poland, e-mail: januszbedkowski@gmail.com.
Acknowledgements This work is performed under the funding of Polish National Center of Science, project ”Methodology of semantic models building based on mobile robot’s observations(2012-2015)” grant agreement UMO2011/03/D/ST6/03175.
Fig. 7. The performance of normal vector estimation of depth image 640x480 vertexes computed via CPU(PCL) and GPU(cuPCL).
References [1] J. Elseberg, D. Borrmann, A. Nüchter, “Efficient processing of large 3d point clouds”. In: Proceedings of the XXIII International Symposium on Information, Communication and Automation Technologies (ICAT11), Sarajevo, Bosnia, 2011. [2] A. Nüchter, J. Hertzberg, “Towards semantic maps for mobile robots”, Robot. Auton. Syst., vol. 56, 2008, pp. 915–926. doi:10.1016/j.robot.2008.08.001. URL http://dl.acm.org/citation.cfm?id=1453261. 1453481 [3] M. Asada, Y. Shirai, “Building a world model for a mobile robot using dynamic semantic constraints”. In: Proc. 11th International Joint Conference on Artificial Intelligence, 1989, pp. 1629–1634. [4] A. Nüchter, O. Wulf, K. Lingemann, J. Hertzberg, B. Wagner, H. Surmann, “3d mapping with semantic knowledge”. In: RoboCup International Symposium, 2005, pp. 335–346.
Fig. 8. The performance of normal vector estimation of depth image 320x240 vertexes computed via CPU(PCL) and GPU(cuPCL).
[5] O. Grau, “A scene analysis system for the generation of 3-d models”. In: NRC ’97: Proceedings of the International Conference on Recent Advances in 3-D Digital Imaging and Modeling, IEEE Computer Society, Washington, DC, USA, 1997, p. 221. [6] A. Nüchter, H. Surmann, K. Lingemann, J. Hertzberg, “Semantic scene analysis of scanned 3d indoor environments”. In: Proceedings of the 8th International Fall Workshop on Vision, Modeling, and Visualization (VMV 03), 2003, pp. 665–673. [7] H. Cantzler, R. B. Fisher, M. Devy, “Quality enhancement of reconstructed 3d models using coplanarity and constraints”. In: Proceedings of the 24th DAGM Symposium on Pattern Recognition, Springer-Verlag, London 2002, pp. 34–41. URL http://dl.acm.org/citation.cfm?id=648287. 756379
Fig. 9. The performance of 64 Point Feature Histograms computed via CPU(PCL) and GPU(cuPCL).
[8] M. A. Fischler, R. Bolles, “Random sample consensus. A paradigm for model fitting with applications to image analysis and automated cartography”. In: Proc. 1980 Image Understanding Workshop (College Park, Md., Apr 1980), L. S. Baumann (ed.), Science Applications, Arlington, Va., 1980, pp. 71–88.
it will determine the use of more GPGPUs assuming the same performance.
[9] M. Eich, M. Dabrowska, F. Kirchner, “Semantic labeling: Classification of 3d entities based on spatial Articles
45
Journal of Automation, Mobile Robotics & Intelligent Systems
feature descriptors”. In: IEEE International Conference on Robotics and Automation (ICRA2010), Anchorage, Alaska, May 3, 2010. [10] N. Vaskevicius, A. Birk, K. Pathak, J. Poppinga, “Fast detection of polygons in 3d point clouds from noise-prone range sensors”. In: IEEE International Workshop on Safety, Security and Rescue Robotics, SSRR, IEEE, Rome, 2007, pp. 1–6. [11] R. B. Rusu, Z. C. Marton, N. Blodow, M. Beetz, “Learning informative point classes for the acquisition of object model maps”, In: Proc. 10th International Conference on Control, Automation, Robotics and Vision ICARCV, 2008, pp. 643–650. [12] R. B. Rusu, N. Blodow, M. Beetz, “Fast point feature histograms (fpfh) for 3d registration”. In: Proc. IEEE International Conference on Robotics and Automation ICRA09, 2009, pp. 3212–3217.
46
Articles
VOLUME 7,
N◦ 1
2013
Journal of Automation, Mobile Robotics & Intelligent Systems
VOLUME 7,
Nâ&#x2014;Ś 1
2013
Interactive Mapping Using Range Sensor Data under Localization Uncertainty Received 10th October 2012; accepted 22nd November 2012.
Pedro Vieira, Rodrigo Ventura Abstract: Several methods have been proposed in the literature to address the problem of automatic mapping by a robot using range scan data, under localization uncertainty. Most scan matching methods rely on the minimization of the matching error among individual range scans. However, uncertainty in sensor data often leads to erroneous matching, hard to cope with in a purely automatic approach. This paper proposes a semi-automatic approach, denoted interactive mapping, involving a human operator in the process of detecting and correcting erroneous matches. Instead of allowing the operator complete freedom in correcting the matching in a frame by frame basis, the proposed method facilitates the adjustment along the directions with more ambiguity, while constraining the others. Experimental results using LIDAR data are presented to validate empirically the approach, together with a preliminary user study to evaluate the benefits of the approach. Keywords: 2D mapping, interactive interface, ICP, geometric constraint analysis, iterative methods
1. Introduction The problem of automatic environment mapping by a robot has been an active field of research for a long time (see [1] for a review). Methods vary both in terms of sensors used (e.g., sonars [2], LIDAR [3], vision [4], and more recently the Kinect [5], [9]), and in methodologies (e.g., probabilistic [6], scan matching [3]). However, most of these methods are prone to local minima, originated, for instance, by ambiguity or by locally periodic patterns in the environment. Moreover, the loop closure problem remains a not completely solved problem [7]. Approaches aiming at global consistency [3] have been proposed, however, they are often computationally complex, and yet prone to error when faced with missing data. In this paper we propose an alternative approach where we consider the human-in-the-loop of the scan matching process. In particular, the user is asked to interact with the matching process, by adjusting the match of individual pairs of scans. This adjustment is however constrained by favouring adjustments along the directions of greater ambiguity. Take for instance a case of pairs of identical scans taken from a homogeneous corridor (Fig. 1): the system favours the adjustment of the scans along the corridor, while disfavouring movements orthogonal to the corridor. The proposed method is based on a graphical user interface (GUI), where the user interacts with the system using a common computer interface (mouse and keyboard). Consider a pair of range scans denoted M (for model) and D (for data). Fig. 1 illustrates the situation for the corridor example. When the user drags one of the range scans, say
mouse drag
M
Fm Fr
D scan adjustment
Fig. 1. Two corridors being align. The scan D is adjusted such that the reaction force Fr , computed from the cost function gradient, balances the force Fm imposed by the mouse drag.
scan D, using the mouse, a corresponding virtual force Fm is produced (proportional to the drag vector). Opposing it, a reaction force Fr is computed based in the match between scans M and D. Considering the scan matching cost function as a potential function, the symmetric of its gradient with respect to the shift of scan D corresponds to the reaction force Fr . This reaction force â&#x20AC;&#x153;attractsâ&#x20AC;? scan D towards M, along the direction of less ambiguity. Scan D is then moved such that both forces become balanced, Fm + Fr = 0. In the corridor example in Fig. 1, the reaction force is (mostly) orthogonal to the corridor axis. Although the method has been implemented for 2D mapping, it can be easily extended to 3D. On the one hand, the matching cost function is applicable to 3D, and on the other, 2D mouse movements can be mapped into 3D forces, depending on the map viewpoint in the GUI, for instance. The work presented here is a preliminary study towards a full 3D implementation. Such method can be used to make accurate maps, that can then be used, for instance, in search and rescue missions. This missions can vary from searching for victims within risky areas, to analysing certain areas for better mission planning. The availability of a map of these areas can improve dramatically the efficiency and effectiveness of these operations. Little related work can be found in the literature concerning interactive mapping. There is however some work on the interactive integration of semantic information in maps. Systems which have explored this topic include SemanticSLAM [10] and Interactive SLAM [11], both prototypes not developed beyond their early stages. This paper is organized as follows. In Section 2 the method for interactive mapping is proposed, followed by Section 3 showing the results, together with a small user study. Finally, Section 4 draws some conclusions and disArticles
47
Journal of Automation, Mobile Robotics & Intelligent Systems
cuss possible future work directions.
VOLUME 7,
Consider two scans in the 2D space, M = {mk } and D = {dk }, for mk , dk ∈ R2 , and an initial rigid transformation (R,t) which align D with M , where R is a rotation matrix parametrized by an angle θ: cos (θ) − sin (θ) R (θ) = , (1) sin (θ) cos (θ) and t is a translation vector. Although this transformation can be initialized with a default value (e.g., R|θ=0 = I2×2 and t = (0, 0)> ), we start by using the Iterative Closest Points (ICP) algorithm (see [8]) to get the best initial transformation automatically. By applying this transformation to D, a transformed scan, D0 , is obtained. The two scans (M and D0 ) are then presented in a viewer (GUI) for alignment analysis. If the alignment isn’t well done, the user interacts with the viewer using common computer mouse and keyboard, and apply either translations or rotations to D0 , in order to correct it. These actions are carried out separately using a designated key to choose which mode to use. Scans are aligned by balancing a virtual force originated by a mouse drag, henceforth called mouse force. This drag is defined by two points: po , the mouse pointer position where the user clicked on the mouse, and pf , the current pointer position. We consider the mouse force to be proportional to the difference between pf , and the initial point po transformed by the alignment under way. In the general case, we assume D0 rotated by θ with respect to a center of mass c, and then translated by t. Then, point po is transformed into po 0 = R(θ)(po − c) + c + t. Now, we define a potential function Jm that grows with the distance between points po 0 and pf : 1 kpf − po 0 k2 . 2
(2)
By taking the gradient of this potential, with respect to either translation t and rotation R, one obtains the virtual forces and torques induced by the mouse drag. The potential function that minimizes the distances between corresponding points of the two point clouds, and is responsible for creating the reaction force to the mouse drag, is given by: N
Jr =
1X kmk − [R(θ)(dk − c) + c + tk2 , 2
(3)
k=1
where {mk } and {dk } are pairs of closest points from M and D0 respectively, and N is the number of these pairs. This function is similar to the cost function used in ICP. The closest points are computed in the same fashion as in ICP (e.g., using Kd-trees), and only the pairs sufficiently close are considered. 2.1. Translations In this mode, the alignment consists in a translation by t. Thus, R(θ) is the identity (θ = 0), and the potential function (2) is simplified. The mouse force Fm is computed 48
Articles
2013
from the gradient of the cost function (2) with respect to the translation t and evaluated at θ = 0:
2. Interactive Alignment
Jm =
N◦ 1
Fm
=
−km ∇t Jm |θ=0
=
km (pf − po − t) ,
(4)
where km is the proportionality constant. Opposing this force, a reaction force Fr is computed from the gradient of the cost function (3) with respect to the translation t, and evaluated at θ = 0: Fr
= −kr ∇t Jr |θ=0 = kr
N X
[mk − (dk + t)],
(5)
k=1
where kr is the proportionality constant. Fig. 2 illustrates the forces involved. pf
Fm
D M
0
Fr
mouse cursor
po
Fig. 2. Geometry of the force balance for translations before alignment (t = 0). To find a scan adjustment that balances the mouse and the reaction forces, translations are iteratively performed, since each time scan D0 moves, the correspondences among points may change. So, for each iteration, a translation t is computed by solving the equation Fm + Fr = 0,
(6)
with respect to t. This equation has the following closed form solution: PN km (pf − po ) + kr k=1 (mk − dk ) t= . (7) km + N kr The scan adjustment results from the algorithm 1. Algorithm 1 Compute Translation. function Translation(po , pf , M, D0 ) Dnew ← D0 repeat P ← ComputeClosestP oints(M, Dnew ) t ← ComputeT ranslation(po , pf , P ) Dnew ← U pdateScan(D0 , t) until ErrorChange(t) < ξt return t end function The algorithm receives, as input, the mouse initial and current position and the two scans being adjusted (M and D0 ). The algorithm starts by computing a subset, P , of pairs of closest points (ComputeClosestPoints) from the two scans (P ⊂ [mk , dk ]). The subset P is then used together with the mouse positions in (7) to compute the
Journal of Automation, Mobile Robotics & Intelligent Systems
translation (ComputeTranslation). Finally scan D0 is updated (UpdateScan) with the new estimate of the translation and the convergence is checked (ErrorChange). The algorithm iterates until the correspondences between points are the same, or in other words until the norm of the difference between the new and the previous translation estimation falls below a threshold: kti − ti−1 k < ξt ,
(8)
where ti−1 and ti are the estimation of the translation in iteration i − 1 and i respectively and ξt is a threshold. The obtained t = [tx ty ]T corresponds to the homogeneous transformation: 1 0 tx Tt = 0 1 ty . (9) 0 0 1 2.2. Rotations Rotation mode comprises a rotation of scan D0 , with respect to the center of mass c, by an angle θ (Fig. 3). The center of mass of a scan is set to its centroid.
VOLUME 7,
⌧m
M D0
c
r
mouse cursor
po
Fig. 3. Torque produced by force originated from a mouse drag prior to alignment (θ = 0). When the user clicks on the scan D0 and attempts to drag it, a force Fm is created using (2), for t = 0. However, unlike translations, the balance is formulated here in terms of virtual torques. The mouse torque τm is obtained by computing the gradient of the cost function (2) with respect to θ, and evaluated at t = 0: τm = −km ∇θ Jm |t=0 .
(10)
The opposing torque τr is obtained by computing the gradient of the cost function (3) with respect to θ, and evaluated at t = 0: τr = −kr ∇θ Jr |t=0 .
(11)
Both gradients can be computed using the chain rule: h i T ∇θ Jm |t=0 = tr (dm ) dR , (12) h i T ∇θ Jr |t=0 = tr (dr ) dR . (13) where =
dr
=
dR
=
∂R (θ) = AR (θ) , ∂θ
dR =
(17)
0 −1 where A = . 1 0 The partial derivatives (14) and (15) are trivially obtained by computing the derivative of (2) and (3), respectively, with respect to R: dm
=
dr
=
−pf 0 ri T , N X − mk 0 (dk 0 )T ,
(18) (19)
k=1
where ri = (po − c), pf 0 = (pf − c), dk 0 = (dk − c) and mk 0 = (mk − c). Both torques are obtained by computing the trace of (12) and (13) and using the partial derivatives (17), (18) and (19): τm
=
τr
=
km ri T RT AT pf 0 , N X kr (dk 0 )T RT AT mk 0 ,
(20) (21)
k=1
⌧r
dm
2013
The partial derivative (16) can be computed from the derivative of (1) with respect to θ:
pf
Fm
N◦ 1
∂Jm , ∂R (θ) ∂Jr , ∂R (θ) ∂R (θ) . ∂θ
(14) (15) (16)
Note that τm can also be interpreted as the norm of the cross product between the arm r = R(θ)(po − c) and the mouse force Fm . As in the case of translations, the scan D0 is iteratively rotated until convergence of the correspondences is reached. In each iteration, the balance of torques τm + τr = 0,
(22)
has a closed form solution. Using (20) and (21) in (22), we can obtain PN km ri T AT pf 0 + kr k=1 (dk 0 )T AT mk 0 tan(θ) = . PN km ri T pf 0 + kr k=1 (dk 0 )T mk 0 (23) However, this only allows us to compute θ up to a π congruence. To resolve this ambiguity, which is caused by the existence of two solutions π radians apart, one has to determine which one corresponds to a stable solution. This can be easily determined from the sign of the derivative of the total torque τ = τm + τr , N
X ∂τ = −km ri T RT (θ)pf 0 − kr (dk 0 )T RT (θ) mk 0 . ∂θ k=1 (24) A solution is stable if and only if the sign of this derivative is negative. A positive derivative implies that a small perturbation in θ will swing the scan π radians towards the other solution. Note that (24) has opposite signs for angles θ and θ + π, and therefore there is always a single negative solution. The scan adjustment follows a similar algorithm as in the case of translations (Al. 2). The algorithm starts by computing the scan centroid (ComputeScanCentroid) and used it to center the mouse positions in the origin. In Articles
49
Journal of Automation, Mobile Robotics & Intelligent Systems
VOLUME 7,
N◦ 1
2013
Algorithm 2 Compute Rotation Angle. function Rotations(po , pf , M, D0 ) c ← ComputeScanCentroid(D0 ) Dnew ← D0 ri ← po − c p0f ← pf − c repeat P ← ComputeClosestP oints(M, Dnew ) for each pair of closest points in P do m0k ← mk − c d0k ← dk − c P 0 ← (m0k , d0k ) end for θ ← ComputeRotationAngle(ri , p0f , P 0 ) if sign(T orqueDer(θ, ri , p0f , P 0 )) > 0 then θ ←θ+π end if Dnew ← U pdateScan(D0 , θ) until ErrorChange(θ) < ξθ return θ end function
Fig. 4. Initial dataset of a map with 118 scans, obtained using a simple implementation of EKF together with scan matching using the LIDAR.
each iteration, new correspondences are obtained (ComputeClosestPoints) and used to compute the rotation angle (ComputeRotationAngle) with 23. The stable solution is then chosen (TorqueDer) using (24). Finally scan D0 is updated (UpdateScan) with the new estimate of the rotation angle and the convergence is checked (ErrorChange). The algorithm iterates until the module of the difference between the new and the previous rotation angle falls below a threshold: |θi − θi−1 | < ξθ , (25) where θi−1 and θi are the previous and the new estimation of the rotation angle and ξθ is a threshold. A homogeneous transformation reflecting the rotation performed, is then created using the value θ: cos(θ) − sin(θ) bx Tθ = sin(θ) cos(θ) by , (26) 0 0 1 b b = x = [I − R(θ)] c. by
(27)
Note that translation b accounts for the fact that the rotation is performed with respect to center c, rather than to the origin.
3. Results A dataset was, initially, obtained with a Nomadic Scout robot with a Hokuyo LIDAR onboard. Each scan was associated with a robot location estimated by the robot. This estimation was performed using a simple implementation of EKF together with scan matching using the LIDAR. A partial vector map of the environment was used. As a consequence, the localization algorithm got lost upon entering an office outside of its map (upper part in Fig. 4). In order to see how the ICP algorithm performs, we run it over the dataset of Fig. 4, obtaining the result shown in Fig. 5. The ICP was able to align most scans, however 50
Articles
Fig. 5. Map obtained aligning the 118 frames of the initial dataset with ICP.
some of them converged to a local minimum, resulting into a wrong alignment. In Fig. 9 and 10 is shown the alignment of two scans with and without the use of forces for the two interactive modes, translations and rotations respectively. On the left sub-figures of Fig. 9 and 10, we can see that, without the use of forces, the scan movement tends to follow the mouse cursor. On the contrary on the right side, the use of forces favours the adjustment of scan along the corridor, and disfavours the movement orthogonal to it. To validate the proposed method a user study was conducted by taking 8 users selected among the MSc students from Instituto Superior Técnico (IST). They were asked to correct the alignment of 118 frames, from a map, in two trials: with and without the use of forces. Users had no prior knowledge of the map, and everyone performed the correction of the alignment on the same initial dataset (Fig. 4). The user performance was measured by (i) the time they took to finish the task, and by (ii) the value of the cost function, that measures how good the corrections were made. This cost function is the sum of the cost function for
Journal of Automation, Mobile Robotics & Intelligent Systems
VOLUME 7,
all the scan pairs:
N◦ 1
2013
50
f=
n−1 X
Ji ,
(28)
45
where n is the number of scans and Ji is the cost function between two scans Si and Si+1 , computed from expression (3). If the mouse force is much higher than the reaction force, users will not feel the system restraining the movement of the scan, and consequently they will have total control over the scans. On the contrary if the reaction force is much higher, the system will restrain the movement of the scans so strongly that users will not be able to move the scans as they desire. So, in order to have a good trade-off between control over the scans and help from the system, the forces proportionality constants (km and kr ) were empirically adjusted, so that the reaction force is slightly higher than the mouse force. In the following experiments, the values of km = 0.1 and kr = 0.001 were used, together with a closest point threshold of 0.2 meters. After doing each one of the trials, users obtain knowledge about the initial alignment of each frame which will influence the second experiment, so we mitigate this bias by alternating the order of the experiences between users. Half of them were asked to perform the trial without using forces first, while the other half performed the other one first. Fig. 6 shows an example of a map after the interactive alignment, for a random user.
Cost function value
i=1 40
35
30
25 5
10
15
20
25
30
Time (min)
Fig. 7. Results in terms of task completion time and cost function obtained for each user. Each line corresponds to a user, and the “O” and the “X” markers are the results of using or not using forces.
As it is visible in Fig. 7 and Table 1, users were, in average, able to correct the alignment faster and better using forces than not using them. Paired t-tests over the results have shown that both the average time and average cost functions are lower when forces are used (p < 0.005 for time and p < 0.03 for cost). At the end of the trials, users were asked to answer a questionnaire, using the well-known Likert scale as a rating criterion (from 1 to 5). The Likert statements were: 1) The interface used for the interactive mapping was easy to use. 2) The use of forces makes the alignment faster. 3) The use of forces makes the alignment easy.
5
Rating Scale
4
3
2
Fig. 6. One of the maps obtained by a user after using the GUI to manually align the scans (compare with Fig. 4). The results obtained for each user are presented in Fig. 7. Tab. 1. Average and standard deviation values for time and cost function.
µ σ
Without Forces Time Cost function (min) Value 16.27 36.86 5.45 8.09
With Forces Time Cost function (min) Value 10.74 30.60 2.77 3.33
1 1
2 Likert Statements
3
Fig. 8. Box plot of the responses to the questionnaire. In Fig. 8 is shows the box plot responses given by the users to each statement the questionnaire. Overall, users responded positively to the usage of the GUI, in line with the goal of providing a system to facilitate manual alignment of range scans.
4. Conclusions In this paper we have proposed interactive method: a method using virtual forces with the purpose of helping Articles
51
Journal of Automation, Mobile Robotics & Intelligent Systems
VOLUME 7,
N◦ 1
2013
Fig. 9. Two corridors being aligned by applying translations. Left three figures: alignment without forces. Right three figures: alignment with forces. The blue/darker scan (M ) is the original segment, the green/brighter scan (D0 ) is the new one, the black arrow is the mouse current position, po 0 is the initial point clicked, c is the green scan center of mass and the back ”O” are references points.
users to adjust the alignment of range scan data. The use of forces proved to be a valuable aid in the correction of the alignment, making it faster and more easy to use. A small user evaluation has corroborated this method. As future work we propose extending the concepts proposed in this paper to 3D.
AUTHORS Pedro Vieira∗ – Institute for Systems and Robotics, Instituto Superior Técnico , e-mail: pedro.s.vieira@ist.utl.pt. Rodrigo Ventura – Institute for Systems and Robotics, Instituto Superior Técnico, e-mail: rodrigo.ventura@isr.ist.utl.pt. ∗
Corresponding author
Acknowledgements This work was supported by FCT projects [PEst-OE/ EEI/LA0009/2011] and [PTDC/EIA-CCO/113257/2009]. 52
Articles
Journal of Automation, Mobile Robotics & Intelligent Systems
VOLUME 7,
N◦ 1
2013
Fig. 10. Two corridors being aligned by applying rotations. Left two figures: alignment without forces. Right two figures: alignment with forces. The blue/darker scan (M ) is the original segment, the green/brighter scan (D0 ) is the new one, the black arrow is the mouse current position, po 0 is the initial point clicked, c is the green scan center of mass and the back ”O” are references points.
References [1] S. Thrun, “Robotic mapping: A survey”. In: Exploring artificial intelligence in the new millennium, G. Lakemeyer, B. Nebel (eds.), Morgan Kaufmann, 2002, pp. 1–35. [2] A. Elfes, “Sonar-based real-world mapping and navigation”, IEEE J. Robot. Automat., vol. 3, no. 3, 1987, pp. 249–265. [3] F. Lu, E. Milios, “Globally consistent range scan alignment for environment mapping”, Autonomous robots, vol. 4, no. 4, 1997, pp. 333–349. [4] D. Murray, C. Jennings, “Stereo vision based mapping and navigation for mobile robots”. In: Proceedings of IEEE International Conference on Robotics and Automation (ICRA’97), vol. 2, April 1997, pp. 1694– 1699. [5] P. Henry, M. Krainin, E. Herbst, X. Ren, and D. Fox, “RGB-D mapping: Using depth cameras for dense 3d modeling of indoor environments”. In: The 12th International Symposium on Experimental Robotics (ISER), 2010. [6] A. Elfes, “Occupancy grids: A probabilistic framework for robot perception and navigation”, Ph.D. dissertation, Carnegie Mellon University, 1989.
[7] P. Newman, G. Sibley, M. Smith, M. Cummins, A. Harrison, C. Mei, I. Posner, R. Shade, D. Schroeter, D. Cole, I. Reid, “Navigating, recognising and describing urban spaces with vision and lasers”, The International Journal of Robotics Research, vol. 28, no. 11–12, 2009, pp. 1406–1433. [8] J. Besl, N. D. McKay, “A method for registration of 3-d shapes”, IEEE Trans. Pattern Anal. Mach. Intell., 14(2), 1992, pp. 239–256. [9] K. Lai, L. Bo, X. Ren, D. Fox, “A large-scale hierarchical multi-view RGB-D object dataset”. In: Proceedings of IEEE International Conference on Robotics and Automation (ICRA’11), 2011, pp. 1817– 1824. [10] F. Dellaert, D. Bruemmer, “Semantic slam for collaborative cognitive workspaces”. In: AAAI Fall Symposium Series 2004: Workshop on The Interaction of Cognitive Science and Robotics: From Interfaces to Intelligence, 2004. [11] A. Diosi, G. Taylor, L. Kleeman, “Interactive SLAM using laser and advanced sonar”. In: Proceedings of IEEE International Conference on Robotics and Automation (ICRA’05), 2005, pp. 1103–1108.
Articles
53
Journal of Automation, Mobile Robotics & Intelligent Systems
VOLUME 7,
N◦ 1
2013
Assisted Teleoperation of Quadcopters Using Obstacle Avoidance Received 10th October 2012; accepted 22nd November 2012.
João Mendes, Rodrigo Ventura Abstract: Teleoperation of unmanned aerial vehicles often demands extensive training. Yet, even well trained pilots are prone to mistakes, resulting frequently in collisions of the vehicle with obstacles. This paper presents a method to assist the tele-operation of a quadrotor using an obstacle avoidance approach. In particular, rough map of the nearby environment is constructed using sonar sensors. This map is constructed using FastSLAM to allow tracking of the vehicle position with respect to the map. The map is then used to override operator commands that may lead to a collision. An unknown and GPS denied environment is considered. Experimental results using the USARsim simulator are presented. Keywords: collision avoidance, 3D FastSLAM, quadcopter, occupancy grid mapping
1. Introduction A quadcopter is an aircraft propelled by four rotors. This type of vehicle fits in the Vertical Take Off and Landing (VTOL) category as they can successfully perform vertical take offs, landings and hovering despite being heavier than air. The advantages of VTOLs to other flying mechanisms are notorious, as shown in [1]. However, flying a quadrotor is not a simple task: for a pilot to have a safe flight and control the vehicle as desired, significant experience is required. Even extensively trained pilots may face situations where it is difficult to guarantee a safe flight due to, for instance, loss of visual contact. A method to automatically avoid collisions is, therefore, a major necessity. The presented solution is an assistive algorithm for tele-operation of a quadrotor in order to avoid collisions based on a FastSLAM [2] approach using an occupancy grid map [3]. Since the purpose of this work is neither a detailed map of the environment nor a precise measurement of obstacles positions, the problem can be efficiently addressed by knowing the relative position of the quadrotor in relation to the object. After knowing the vehicle’s position and map, a decision making process based on danger assessment, performed by a classifier, is applied. This classifier overrides the user’s inputs whenever they compromise the quadcopter’s physical integrity in the near future. Overriding may range from simple velocity reduction to, in extreme cases, an evasive maneuver. Unknown areas are, for the sake of safety, always considered as occupied. The main difference between our approach and a simple reactive algorithm is memory. Unlike a purely reactive methodology, if the map is kept in memory it is possible to avoid crashes in sonar’s blind spots. The full architecture is presented in Figure 1. 54
Articles
Fig. 1. Full architecture of the proposed approach. 3D Simultaneous Localization and Mapping (SLAM) in Unmanned Aerial Vehicles (UAV) using lasers has been studied but typically including techniques, such as loop closure algorithms [4]. As for obstacle avoidance methodologies for UAVs, literature mostly addresses for path replaning topics [6]. This paper differs from the above in that we aim at a rough and low complexity map, and thus more time efficient. We consider as a simulation test bed a quadrotor equipped with an Inertial Measuring Unit (IMU), an altimeter, and six sonars: one above each of the propellers pointing sideways, one above and one below the main body of the quadcopter.
2. Methodology 2.1. FastSLAM Correct attitude is assumed to be given at all times by an IMU since accurate attitude estimations can be provided by a commercial solution, thus the 6D problem (position and attitude) is reduced to a 3D problem (position only). The objective of SLAM is to estimate the position and the map of a robot simultaneously. Let x1:t denote the path of the robot, m the map, z1:t all measurements and u1:t all control inputs where 1 : t represents the time step from 1 to t. To solve the SLAM problem we use the FastSLAM approach proposed by Motermerlo et al. [5]. By performing a factorization of the posterior: p(x1:t , m|z1:t , u1:t )
(1)
decomposing SLAM into a path estimation problem and a mapping problem (2) hereby solved by a combination of a Particle Filter with an occupancy grid mapping algorithm. p(x1:t |z1:t , u1:t )p(m|z1:t , x1:t )
(2)
The occupancy grid mapping algorithm uses a straightforward application of the inverse sensor model described in [7] where the posterior probability of the map is approximated as the product of the probability of each cell (3).
Journal of Automation, Mobile Robotics & Intelligent Systems
This probability represents the belief on the occupancy of each individual cell. Y p(m|z1:t , x1:t ) = p(mi |z1:t , x1:t ) (3)
VOLUME 7,
N◦ 1
2013
uncertainty along the vertical axis. The resulting weight [n] wt of each particle is equal to the product of all involved likelihoods, assuming conditional independence given the robot position and map, as described in [2].
i
Each cell not being updated is subject to a slow decay towards the value of the prior p(mprior ), in order to introduce a forgetness factor to all long time non-observed cells. Localization estimation is provided by a bootstrap Particle Filter [8]. It is a Bayesian state estimation method which approximates the posterior bel(xt ) = p(xt |zt , ut ) by a set of weighted particles St
[n]
[n]
wt = P (ht |m[n] , x ¯t ) ·
=
(4) where each particle st contains a different hypothesis [n]
x ˜t = [X [n] Y [n] Z [n] ]T
(5)
of the state to estimate xt = [Xr Yr Zr ]T . Multiple particles, considering a total number equal to N , are used and to each one is associated a weight, wt , representing the importance of that specific hypothesis. For each iteration of the particle filter a predict and an update step are performed. The predict step models the effects of the control inputs ut on each particle of St−1 by sampling from the motion model distribution. The referenced motion model is identical to the dynamic model applied by the USARSim. The particle weights are computed in the second step. In this phase, a measurement model is used to evaluate the weight of each particle based on sensors information. This weight is updated by the likelihood of the sensor measurements zt given the prediction x ¯t and the map m. p(zt |m, x ¯t )
[n]
1 √
σdist 2π
[n] 2 2 − zti −d¯t /2σdist
e
(7)
where zti stands for measurement of sonar i, m for the [n] map and d¯t is the Euclidean distance between position [n] hypothesis x ¯t and the first occupied cell in the map of the n-th particle. Note that equation (7) is applied for each sonar. The altimeter measurements are modeled with another Gaussian [n]
P (ht |m[n] , x ¯t ) =
1 √
σalt 2π
[n]
¯ − ht −Z t
e
,
where
[i]
n∗ = arg max wt
(10)
i∈{1,...,N }
2.2. Decision Block The inputs for the Decision Block are the position estimation x ˆt and a binarized version of the map m. The threshold adopted for the binarization is p(mioccupied ) < 0.5 in order to consider the never observed cells as occupied. By analysis of the map and the vehicle estimated velocity it is possible to predict how far away the object is and how long it takes if we maintain the current speed to collide with it. This concept is known as Time To Collision (TTC) and is a crucial step in the classification of the danger levels. The global flow chart of the Decision Block is presented in Figure 2.
(6)
The weight of each particle results from the joint likelihood of all measurements, given the map and the path. These measurements are given by (1) sonars and (2) the altimeter. The sonar measurements are modeled with a Gaussian distribution:
P (zti |m[n] , x ¯t ) =
(9)
In order to determine a single position estimation we choose the particle with the highest weight from the set [n∗ ]
St =
[n]
P (zti |m[n] , x ¯t )
i
x ˆt = x ¯t [1] [2] [N ] {st , st , ..., st } [1] [1] [2] [2] [N ] [N ] {[xt wt m[1] ], [xt wt m[2] ], ..., [xt wt m[N ] ]}
Y
2
2 /2σalt
(8)
where ht is the altimeter measurement. By using the altimeter readings it is possible to significantly reduce the
Fig. 2. Flow chart of the Decision Block. To successfully avoid crashes, the position of obstacles has to be known as well as the direction to which the robot is flying. The volume check corresponds to the extrusion of a square centered on the quadcopter’s position, along the velocity vector, as illustrated in Figure 3. The size of this square, b, encompasses the quadrotor volume while a is a visibility bound. Defining this volume enables the algorithm to find which is the position of the closest occupied cell. Vehicle b
v a
Fig. 3. Graphical definition of the volume check. Articles
55
Journal of Automation, Mobile Robotics & Intelligent Systems
Upon having the velocity estimation and the distance to the closest cell it is easy to compute the TTC and use it as a danger assessment. The classifier block acts as a multiplexer by choosing an input, and forwarding it to the vehicle, given a certain TTC. The threat levels, together with the contingency actions considered, are presented in Figure 4 and explained below. – No Action: For the given TTC the algorithm considers that no threat exists and no action is performed on the inputs, meaning that the vehicle is fully controlled by the user. – Slow: If the TTC falls into the given interval the quadcopter is considered to be in medium danger and user’s inputs will be limited. In order for the vehicle to increase its TTC to a safe value, thus causing the decision block leave the Slow threat level, the velocity applied to the vehicle, denoted by vcommand , is vcommand =
dobstacle T T CT HSLOW
(11)
where dobstacle represents the distance between the vehicle and the obstacle, and T T CT HSLOW the threshold between the Slow and No action stage Figure 4.
VOLUME 7,
2013
algorithm will cause the vehicle to alternate between acceleration and deceleration in order to maintain knowledge of where it is heading. The emergence of this behavior can be seen in our experiments.
3. Results The full architecture was implemented and tested in the USARSim environment. A total number of particles of 10, together with σdist = 2, σalt = 2, F oV = 30◦ , maximum sonar range equal to 5 meters and a square cell with length equal to 1 meter was considered. The examples inputs are the user’s commands, attitude and sonars readings. Neither the map nor the position is provided by the simulator. Note that the FastSLAM and the decision block are running online during real time simulation on USARSim. In the first example the robot was ordered a full speed movement towards a wall out of sonars range. As it is observable from Figure 5, the vehicle is, up until around 7 seconds, consecutively lowering and increasing its distance to the obstacle. Since this distance keeps oscillating it means that no real obstacle is there but, as it is the first time the area is mapped, the algorithm is updating the cells to free while moving. After 7 seconds the distance starts to lower meaning a real obstacle was found. 7
– Stop: At this level, threat is considered to be high and velocity is immediately set to zero making the vehicle hover.
6
5 Distance [m]
– Evasive Maneuver: At this level the threat is considered to be extremely high and the solution to avoid collision in this situation is to give a velocity in the opposite direction of the current movement. By doing so, the distance while decelerating is much lower than in the STOP stage.
N◦ 1
4
3
2
1
0
5
10
15
Time [s]
Fig. 5. Distance between the best particle to the first obstacle.
Fig. 4. Threat levels. Consider as an example that the vehicle is ordered to move towards an unknown part of the map. Since no information regarding those cells is yet known, the algorithm will consider them as occupied for safety reasons. Therefore, the above mentioned levels are equally applied. A problem occurs whenever the attitude of the vehicle is higher than half the field of view of the sonar, i.e., the sonar will not be able to check whether the volume in the desired direction is occupied or not as it is not pointing there. By direct application of the computation and selection of the danger level, it is possible to conclude that a Slow order is given and the velocity will lower. The sonar will then point back towards the direction of the movement, due to velocity reduction, and the distance to the object is updated. The 56
Articles
Figure 6 shows the estimated velocity of the vehicle along time. Note that the velocity is also oscillating. The user is ordering a full speed movement (5 m/s) but the classifier applies a Slow override. When the obstacle is sensed to be further away from the robot, the inputs are once again given to the user. When the algorithm perceives that the vehicle is moving towards an obstacle, once again after 7 seconds, an override is imposed and the velocity fully limited by the classifier and lower till zero. By fusing the information from both figures it is possible to see that the vehicle stopped its movement at, approximately, 1.3 meters from the wall despite being constantly ordered by the user to move in that direction. If, at this point, the robot were ordered to fly in the opposite direction the vehicle would fly at full speed since those cells are known to be free according to the map. In the second example the trajectory performed has three distinct phases: the robot was initialized far from a wall and a velocity imposed towards it; a movement
Journal of Automation, Mobile Robotics & Intelligent Systems
VOLUME 7,
N◦ 1
2
40
1.6
38
1.4
36
1.2
34
1 y[m]
Velocity [m/s]
Particle Filter Real
42
1.8
2013
0.8
32 30
0.6
28
0.4 26
0.2
Real FastSLAM
24
0
0
5
10
15
Time [s]
22 24
26
28
30
32
34
36
38
x[m]
Fig. 6. Estimated velocity of the quadrotor corresponding to Fig. 5. along the wall; a separation and re-approximation to the wall. With this experiment the main benefit of our approach facing a purely reactive method is shown. That distinction is proven useful in the final part of the movement where the vehicle is expected to keep a memory of the obstacle previously seen. All movements were performed at full speed. Results are presented in Figures 7 and 8. Occupied Free
50 45 40 35
y[m]
30 25 20 15
Real
10 5 0
0
5
10
15
20
25 x[m]
30
35
40
45
50
Fig. 7. XY cut of the 3D map of the best particle at the end of the movement. The Real labeled arrow indicates the limit of the groundthruth wall location. During this experiment, the algorithm faces two near collision situations. In both cases, the program managed to fulfill its objective and avoid the crash despite the orders from the user to continue its trajectory towards the wall. Although both collisions were avoided a major difference arises between them. The second time the algorithm was moving towards the wall it managed to avoid the crash at a higher distance from the wall. After sensing the obstacle once, it is able to keep a localization in relation to it and it is also able to prevent crashes more effectively. It is noticeable that, despite having drift away from the real position, the algorithm managed to build a map, Figure 7, according to the belief of its position and localize itself on it. Since our goal is to localize the robot relatively to our map, we argue that the algorithm’s performance is not directly compromised by the error between the real and
Fig. 8. Localization of the best particle at the end of the movement. The arrows represents the limits of the groundthruth wall and the obtained map.
the estimated position. The obtained results match our expectations and can be seen as a proof of concept towards a real world implementation.
4. Conclusions The main objective of this work is to develop an assisting tele-operation algorithm for quadcopters with the purpose of avoiding collisions with obstacles. The desired solution considered as head objective the safety of the quadcopter even in situations where user’s commands were contrary. Whenever confronted with an unknown area, the algorithm overrides the inputs in order to reduce uncertainty. The main objective was successfully achieved in simulation using a FastSLAM approach for simultaneous localization and mapping combined with a danger classification methodology, in order to classify and act correspondingly in any situations. As for future work, we are currently working at a real world implementation of the proposed method.
AUTHORS João Mendes – Institute for Systems and Robotics, Instituto Superior Técnico, Portugal, e-mail: mendes.joao.p@gmail.com. Rodrigo Ventura – Institute for Systems and Robotics, Instituto Superior Técnico, Portugal, e-mail: rodrigo.ventura@isr.ist.utl.pt.
Acknowledgements This work was supported by the FCT projects [PEst-OE/EEI/LA0009/2011] and [PTDC/EIACCO/113257/2009].
References [1] S. Bouabdallah, “Design and Control of an Indoor Micro Quadrotor”. In: Proceedings of International Conference on Robotics and Automation, 2004. Articles
57
Journal of Automation, Mobile Robotics & Intelligent Systems
[2] A. Stentz, D. Fox and M. Montemerlo, “FastSLAM: A Factored Solution to the Simultaneous Localization and Mapping Problem with Unknown Data Association”. In: Proceedings of the AAAI National Conference on Artificial Intelligence, 2003. [3] A. Elfes, “Occupancy grids: A probabilistic framework for robot perception and navigation”, Ph.D. dissertation, Carnegie Mellon University, 1989. [4] C. Stachniss, D. Hähnell, W. Burgard, G. Grisetti, “On Actively Closing Loops in Grid-based FastSLAM”, Advanced Robotics, vol. 19, 2005. [5] M. Montemerlo, S. Thrun, D. Koller, B. Wegbreit, “FastSLAM: A Factored Solution to the Simultaneous Localization and Mapping Problem”. In: Proceedings of the AAAI National Conference on Artificial Intelligence, 2002, pp. 593-598. [6] Z. He, R. Iyer, P. Chandler, “Vision-Based UAV Flight Control and Obstacle Avoidance”. In: Proceedings of the American Control Conference, 2006, pp. 2166-2170. [7] S. Thrun, W. Burgard, D. Fox, “Probabilistic Robotics (Intelligent Robotics and Autonomous Agents)”, The MIT Press, 2005. [8] N. Gordon, D. Salmond, A. Smith, “Radar and Signal Processing”, IEEE Proceedings, vol. 140, 1993, pp. 107-113.
58
Articles
VOLUME 7,
N◦ 1
2013