16 minute read
Visualization and Image Analysis (VIA) Laboratory, Department of Bioengineering
Computer vision methods for tool guidance in a finger-mounted device for the blind
Yuxuan Hu, Rene R. Canady, B.S., Roberta Klatzky, Ph.D. and George Stetten, M.D., Ph. D.
The Visualization and Image Analysis (VIA) Laboratory, Department of Bioengineering
Yuxuan Hu Yuxuan Hu is a Bioengineering senior at the University of Pittsburgh. He will be a graduate student next year at ETH in Zurich studying surgical vision and imaging.
Rene R. Canady, B.S. Rene R. Canady received a BS in bioengineering in 2020 from the University of Pittsburgh and is currently pursuing a Ph.D. in Sociology at Washington University, St. Louis. Her research interests include engineering ethnography and racial controversies in health.
Roberta Klatzky, Ph.D. Roberta Klatzky is a professor of Psychology and Human-Computer Interaction at Carnegie Mellon. She enjoys combining basic research with applications.
George Stetten, M.D., Ph.D. George Stetten is Professor of Bioengineering at the University of Pittsburgh. He directs the Visualization and Image Analysis (VIA) Laboratory and the Music Engineering Laboratory (MEL) and is a fellow in the American Institute for Medical and Biological Engineering.
Significance Statement
Tool handling and close-up operation are challenges for the visually impaired that have not been well addressed by assistive devices. We have come up with novel and computationally efficient computer vision methods for real-time tool detection and target motion classification, to improve on “FingerSight,” a finger-mounted haptic device designed to help the visually impaired complete daily tasks.
Category: Methods
Keywords: Assistive Technology, Haptics, Computer
Vision, Image Analysis
Abstract
People with visual impairment often find difficulty in performing high precision tasks such as interacting with a target using a tool. We propose a device for the visually impaired that can provide accurate localization of targets via vibratory feedback that makes tool-handling tasks easier. The device is an adaptation of our existing system, “FingerSight,” a finger-mounted device consisting of a camera and four vibrators (tactors) that respond to analysis of images from the camera. We report here on a new design for the hardware and optimization of real-time algorithms for tool detection and motion classification. These include the determination that a duration of tactor vibration of 90 ms yielded a minimum error rate (5%) in tactor identification and that the best parameters for the tool recognition algorithm were threshold = 13, kernel size = 11, yielding an average tracking error of 23 pixels in a 640 x 480 pixel camera frame. Anecdotal results obtained from single healthy blind-folded subject show the device’s functionality as a whole and potential for providing guidance for the visually impaired manipulating tools in real-life scenarios.
1. Introduction
In 2015, approximately 3.22 million people in the United States were visually impaired, while 1.02 million of them were blind. And by 2050, the number of people afflicted by visual impairment is projected to double [1]. Such increasing prevalence of visual impairment has driven scientists and engineers to develop various assistive technologies, many of which utilize computer vision and haptics. Significant progress has been made in the area of user mobility in pedestrian environments. For example, a system using tactors attached to the torso has been developed to localize the user with respect to the surroundings and guide travel while avoiding obstacles [2]. Another study focuses on detecting aerial obstacles with a stereo vision wearable device [3]. However, few solutions have addressed the commonplace problem of finding targets in peripersonal space, i.e., the space within reach where objects can be grasped and manipulated. Our laboratory previously developed a system with a hand mounted binocular cameras and five vibrators to help the user locate nearby targets in 3D space using a depth map generated with computer vision methods [4]. Our present system, “FingerSight,” is a wearable device originally intended to help the visually impaired navigate in the environment and locate targets in peripersonal space. The device, mounted on the finger, contains a camera and a set of vibrators (tactors) that are activated based on computer vision analysis of the real-time camera image. Previous research on this technology by Satpute et al., in our laboratory, demonstrated its effectiveness for guiding blindfolded participants to move their hand to reach an LED target located in front of their body [5].
Based on the working prototype by Satpute et al., our present research focuses on incorporating the experience of using hand-held tools into the basic FingerSight framework. Accurate localization and feedback are needed
particularly when targets are to be optimally contacted using a tool. Two major challenges are real-time recognition of a variety of tools and immediate handling of the contact of tool and target. In this paper, we present a new version of FingerSight with improved functionality and novel computer vision methods to address the two challenges. We developed an experiment to determine the optimal vibratory signal duration in terms of discriminating which of the four tactors was activated. For our initial algorithm to detect hand-held tools used in everyday situations, we chose a plastic fork, since dining is a common scenario in which target may not ideally be manipulated by hand. We conducted experiments to optimize two key parameters in our computer vision algorithm: intensity threshold and kernel size.
2. Methods
2. 1. Hardware Configuration and Error Rate Computation
The hardware design of our device aims to optimize tactile feedback delivery for guidance of the hand in 3D space, while providing portability and comfort over a range of finger sizes without impeding normal use of the hand. It combines a miniature video camera and four tactors embedded into a soft plastic ring attached by wire to a portable Raspberry Pi 4 computer. The ring has four expandable joints within a single 3D printed unit, built on an Ultimaker 3 FDM 3D printer using NinjiaFlex filament, made from formulated thermoplastic polyurethane material. Elongation of the filament by up to 660% in expandable joints provide comfort over a range of finger size, while alowing flexible wires to be securely routed. The polyurethane composition provides adequate vibration reduction, reduceing cross-transmission of vibratory signals between the tactors. The four cylindrical tactors are attached within the ring to contact the user’s index finger, providing guidance in 4 directions in a finger-centric coordinate system. (Fig. 1)
Figure 1. Left: The ring with tactors wired in. Tractors are marked with numbers each of which represents a unique direction in which a movement is expected after vibration (consistent with Table. 2). Right: device worn on hand with camera inserted and guiding hand-held spoon towards plastic strawberry.
We performed an experiment on how well users can discriminate between the four directions when individual tactors are activated on a healthy and blind-folded, 21-yeard-old right-handed male. It was designed not to prove statistical significance, but rather to produce a viable system for further experimentation. Three trials were conducted with different durations of vibratory signals (60 ms, 90 ms, and 120 ms), delivered by four identical vibrators with an operating voltage of 3.7 V and frequency of 205 Hz. The participant was initially trained by activating the tactors in sequence with the correct answer viewed on a screen. Then, 40 signals (10 in each direction) were given in random order. For each signal, the participant was instructed to press one of the four arrow keys to record the detected location, with no time limit. The results were pooled over trials and for each tactor, with a confusion matrix and error rate computed. 2.2 Computer Vision Algorithm
2.2.1 Tool Detection
We assume that the tool is held in a stationary grip without motion relative to the hand, at least over the short term. Thus, the tool will ideally appear to be stationary in the view of the camera, while the background will constantly change as the user moves their hand around. To detect the stationary tool, we calculate the mean pixel intensity at every pixel in the camera’s view over the latest 10 video frames and construct a “mean frame” M(x, y). Assuming the tool stationary is in the view of the camera, the intensity of pixels i(x, y) within the tool for a given frame should fall within some threshold h of the mean intensity for that pixel, while other pixels representing the background could be expected to exceed that threshold. Thus a binary mask b(x, y) can be generated by applying the given threshold to segment the tool from the background.
(1)
To eliminate salt-and-pepper noise in the segmentation, we apply mathematical morphology operators of erosion and dilation with kernel size k to reduce the noise, and then we sort contours by area, with the largest contour being the tool. (Fig 2)
Figure 2. Demonstration of the tool detection algorithm. Ten most recent frames (left) are used to calculate the intermediate “mean frame” (middle), which is subtracted from the current frame (middle under), whose absolute value is thresholded to yield a binary image (right) where positive pixels (white) represent the tool.
Since the tip of the tool will always be at the location farthest away from the bottom right location (if the user is right-handed), we could identify the tip location within the largest contour by sorting the pixel coordinates. By comparing locations of the tip of the tool and the location with the target’s center, the algorithm produces a direction in which the hand should move to reach the target and gives instruction accordingly through the tactors. We had previously studied various strategies for activating the tactors
to indicate direction. As reported in Satpute et.al.’s study, correction of individual axes leads to better results than simultaneous signals in the case of finger mounted tactor guiding [5], and we adapted this approach in our design, e.g., the bottom tactor will activate to the vector (100, -200), instead of both the right and bottom ones at the same time.
Two parameters in the algorithm that directly affect the efficacy of tool detection are the binary mask threshold value h and kernel size k for erosion and dilation, and the direct indicator of tracking accuracy is the distance between the location of the tip of the tool detected and its actual location S. To simulate different holding postures, we set the camera at 2 different distances from the tip of the tool (7 cm and 10.5cm) and recorded 10-second videos of the participant waving the tool around in spontaneous, smooth motion. The videos were recorded against a printed checkerboard to ensure a consistently changing background. We tested 9 settings with 3 empirically chosen values each for h and k: h = 1, 7, 13 & k = 11, 17, 23. Since the tool was held firmly in both videos, we manually located one tip location for each, then calculated S in every frame. By comparing the mean and standard deviation of S in each of the 9 settings, we could determine the optimal algorithm parameters.
2.2.2 Immediate Motion Classification
When the tool is about to reach the target, the user may be negligent and keep moving in the last instructed direction at a relatively high speed. If the user is unsuccessful at acquiring the target, the relative motion of target to tool can fall into one of the four categories: (1) the tool passes the target without making contact; (2) the tool bumps the target straight forward; (3) & (4) the target rolls off to one or the other side of the tool. At a detection rate as fast as the camera’s frame rate allows, we decrease the complexity of the algorithm by using the tool tip location computed in the previous section and divide the camera’s field of view into four quadrants around the tool tip. A thresholded color value c(x, y) is calculated for each sample based on its intensity and thresholds applicable to the target’s color. (2) A sum of all c(x, y), Ai, (i = 1, 2, 3, 4) for each of the 4 quadrants Qi (see Fig. 3) is used to generate a location vector with regard to the tool tip for the target.
(2) At this stage, we use a threshold on the size of the object in the frame to determine whether the object is close enough to the tool for motion classification to commence. Once the running program determines imminent contact of the tool and object, the above-described operation is performed in every two consecutive frames (thus operating at the camera’s frame rate), and by comparing the 2 location vectors obtained, we can acquire an immediate motion vector for the target. According to the motion vector calculated, the appropriate tactor sequence is delivered to instruct corrective maneuvers according to Table I.
Fiure 3. A demonstration of our immediate motion classification method. Here the target (foam cherry) rolls off to the left after contact with the tool (spoon). Appropriate motion will be twisting hand to the left immediately.
Motion Category Corrective Maneuver Vibrator Sequence
Miss without Contact Moving hand backward Vibrator opposite of the last suggested direction: Vibrate Twice
Bumped Forward Moving hand forward Vibrator in the last suggested direction: Vibrate Twice
Roll off to the Right Twisting hand to the right until feeling contact with the target again
Roll off to the Left Twisting hand to the left until feeling contact with the target again Clockwise sequence
Counter-clockwise sequence
Table I. Target motion categorization, corrective maneuver and corresponding tactor sequence.
3. Results
3.1 Discrimination between 4 Tactors
Three runs of 40 tests each were completed on the same trained participant, one for each tactor duration, and confusion matrices were constructed, as shown in Table 2, 3 & 4. A signal duration of 90 ms yielded the minimum overall error rate, 5% (2 erroneous indications over 40 samples).
Table 2. Confusion Matrix of Signal Duration 60ms. The overall error rate is 27.5%.
Table 3. Confusion Matrix of Signal Duration 90ms. The overall error rate is 5% (lowest among the three sets).
Table 4. Confusion Matrix of Signal Duration 120ms The overall error rate is 10% 3.2 Algorithm Parameters for Tool Detection
We completed 18 trials with 9 parameter combinations and 2 holding postures. Error S was calculated as the Pythagorean distance in pixel coordinates between the actual location of the tool tip and its detected location. Means and standard deviations for S over 10-second intervals (5 iterations per second, 50 samples in total) were calculated and the number of frames with successful tool detection N was recorded. Fig. 4 shows that the combination of binary threshold value 13 and kernel size 11 yielded the least average error distance (68.4 and 23.4 pixel), as well as detection of the tool in all 50 samples under both of the posture estimations.
Figure 4. Average Error Distance and its standard deviation over 50 samples are shown for the each of the 2 posture estimations (tool tip to camera distance) and 9 parameter settings. The 6 trials with threshold value of 1 all yielded no valid detection (N = 0), and thus is not displayed in the charts. Threshold value 13 and kernel size 11 yielded the lowest average error while succeeded in all 50 attempts of detection in both posture estimations.
4. Discussion
The experiments presented here demonstrate the system’s promise to provide guidance for the visually impaired of a finger-mounted, computer vision based, assistive device. The tactor localization experiment provides a means of determining the optimal signal duration of the device for individual users to transmit tactile information with satisfactory accuracy, although the hardware’s overall reliability still needs to be proven through experiments with multiple participants. Tool detection experiments yielded an error distance of 68.4 and 23.4 pixels in a 640 x 480 camera frame given optimal setting of algorithm parameters. Such performance is expected to be independent of utensil choice or holding posture, as long as the tip of the tool is present in the camera’s view. This component, when applied alone, showed vulnerability when exposed to a completely homogenous background, since a changing background is a premise of its basic function, although such errors could be reduced by clustering to eliminate momentary errors in tool tip location. We selected plastic utensils for the experiments as the shifting reflection under bright lighting on silverwares can be detected as motion, thus hindering the tool detection process.
Although one subject is insufficient to yield statistical significance, our tactor discrimination experiment yielded an error rate of 5% with signal duration of 90ms, whereas Satpute et al. reported 11.2%. Anecdotal results, obtained by attempting to scoop up a plastic cherry with a spoon while blindfolded show success in reaching the target and in responding to corrective instructions immediately after making contact. At its current stage, our system cannot determine the exact location of objects in 3D space, as localization and guidance are computed from 2D information obtained by single camera. Depth information could be obtained by using stereo cameras, as in the “PalmSight” system [4], but this would be difficult to fit onto a single finger. More extensive experiments on the reliability of the motion classification algorithm and the FingerSight device in general are yet to be conducted, as a result of limited participant recruitment due to the Covid-19 pandemic.
5. Conclusion
We have made significant progress towards adding new functions and hardware designs to the previous FingerSight paradigm of guidance of manual interaction and navigation for the visually impaired in real-life scenarios. Experiment results were used determine optimal device configuration or algorithm parameters for tool detection and tactor localization by comparing quantifiable errors. Future research will include designing formal experiments to demonstrate the efficacy of our current prototype and developing more sophisticated algorithms for target detection and localization in 3D space.
6. Acknowledgement:
Funding from the Center for Medical Innovation and the Swanson School of Engineering Summer Undergraduate Research Internship (SURI) program at the University of Pittsburgh.
7. References
[1] R. Varma et al., “Visual impairment and blindness in adults in the united states,” JAMA Ophthalmology, vol. 134, no. 7, pp. 802–809, 2016. [2] Young Hoon Lee, Gérard Medioni, “RGB-D camera based wearable navigation system for the visually impaired” , Computer Vision and Image Understanding, Volume 149, 2016, Pages 3-20. [3] Juan Manuel Saez Martinez, Francisco Escolano Ruiz. Stereo-based Aerial Obstacle Detection for the Visually Impaired. Workshop on Computer Vision Applications for the Visually Impaired, James Coughlan and Roberto Manduchi, Oct 2008, Marseille, France. inria-00325455 [4] Z. Yu, S. Horvath, A. Delazio, J. Wang, R. Almasi, R. Klatzky, J. Galeotti, G. Stetten
“PalmSight: An assistive technology helping the blind to locate and grasp objects,” Robot. Inst., Carnegie Mellon Univ., Pittsburgh, PA, USA, Tech. Rep., CMU-RI-TR-16-59, Dec. 2016. [5] S. A. Satpute, J. R. Canady, R. L. Klatzky and G. D. Stetten, “FingerSight: A Vibrotactile Wearable Ring for Assistance With Locating and Reaching Objects in Peripersonal Space,” in IEEE Transactions on Haptics, vol. 13, no. 2, pp. 325-333, 1 April-June 2020, doi: 10.1109/ TOH.2019.2945561.