Image based reconstructions of real world scenes are helpful when the information needs to be accessed or shared from some different place. However while working on site those virtual scenes can sometimes be less helpful. Ad hoc note taking when visiting a construction site for the first time can be a problem. The image based reconstruction is often computational intense and therefore requires powerful hardware and time which both might not be available on site. And even if the scene has already been virtualized the virtual interaction space might not be the ideal environment for collaborating with other people who are physically present. Most people who work or visit on site are experts in working and planing with the real world environment and forcing them to use a virtual replacement might drag them down. In this project a prototype is created that allows image based ad hoc note taking for planning while keeping the real world interaction space intact. To achieve that while keeping the hardware requirements low an off-the-shelf smartphone is combined with a laser pointer and a mirror to change the direction of the phones back facing camera.
The device is used by directing the laser pointer at the real world objects and drawing lines and other kinds of annotations on them. The redirected camera records video and audio of the annotations according to the button presses on the user interface. After the recording has been done the annotated objects are reconstructed from the video as panoramic images with the annotated lines and markers drawn over them. For the construction of the panoramas an image alignment pipeline has been implemented that works with linear complexity. This allows to synchronize collections with hundreds of images and hence achieve a comparatively high frame rate to allow a good reconstruction of the drawn lines. The reconstruction like in cases of “Structure from Motion” is normally done via bundle adjustment which as cubic complexity in the number of images. Such an approach requires a lot more calculation time to reconstruct the same panoramic images from the linear approach of this project.
For the image processing pipeline in this project the characteristic features have been extracted with the SIFT – Scaling Invariant Feature Transform method by Lowe 2004 which is also the base for Photo Tourism by Snavely et al. as well as Microsoft Photosynth. Matches between image features are calculated by nearest neighbor evaluation of the features descriptors and homographies between images are calculated via RANSAC. Other than in Lowes original paper the filtering of the matches is done via cross correlating the best partner for each feature from both images. Further filtering is done by eliminating matches with too large difference in feature rotation, eliminating matches that break out of the overlapping area of the joined image (after step 2) and eliminating matches that divert too much from the average direction of all matches (after step 3).
The correlation of two images is used to stitch together panoramic images that correspond to the lines drawn with the laser pointer. This results in a collection of separate panoramic strips. These strips are then correlated again and stitched together as well. Using the specific information about the use case allows for linear time reconstruction and aggregation of all images instead of a more time consuming global adjustment. For correlating the separate strips all matched features of each strip are compressed by replacing them with the one feature represents the cluster the best. (Comparison with average feature).
The reconstructed images are then used as background for the annotations recorded by the user. The scenes can the reviewed either as static images with annotations drawn over them. Or they can be watched as videos including the audio from the time when the annotations were taken. Evaluation of Audio commands can be used to add more flexibility to the annotation content. However verbal annotations can have a negative impact on verbal conversation with other people and break the flow of communication.
As an Alternative to audio commands the touch interface of the mobile application can be used to create different annotations. The audio track is still recorded for review and to attach acoustic labels or descriptions to annotations.
The mobile application is implemented for Android devices and uses the MediaRecorder API. In parallel to recording the video and audio data a log of all annotation data is created. After the transfer of all files to a PC the log is used to split up the recording. A C++ application and the ffmpeg library are used to read the frame rate and extract single images as well as audio slices according to the log file. The mp3gain library is used to normalize audio volume. A description of the whole scene is stored in a JSON file. A second C++ application stitches the images according to the scene description and with the use of the OpenCV image processing library. The full reconstructed scenes can then be accessed via a web interface where individual annotations can be picked and reviewed. A test scene can be found here.
This work was part of a cooperation between RheinMain University of Applied Sciences and Darmstadt University of Applied Sciences, funded by the program “Forschung für die Praxis” of the Hessian Ministry for Science and Art and supported by Ove Arup & Parnters, Hessisches Baumanagement, Fraunhofer IGD, Goethe Universität Frankfurt a.M., und University of London.