Abstract
The construction industry’s astonishing accident and casualty numbers pose urgent challenges to industry practitioners and researchers. The lack of management officers hinders strict on-site monitoring and inspection. With the development of deep learning and artificial intelligence (AI), many intelligent agents based on surveillance cameras have been applied to replace human labor for automated construction management. Scene understanding and reasoning, also known as vision-based reasoning (or visual reasoning), are essential parts of camera-based automated construction management. However, current works on scene understanding and reasoning in construction still have many limitations, such as single-modal understanding/reasoning, lack of research data, no spatial perception, low efficiency and accuracy, etc. Scene understanding and reasoning can be implemented in two-step and one-step forms. The two-step method relies on the intermediate results (e.g., image captions, object bounding boxes, semantic labels, etc.) generated based on scene understanding to connect the subsequent reasoning, while the one-step method realizes the process from understanding to reasoning without generating intermediate results. This study aims to extensively explore multiple implementations of scene understanding and reasoning in construction based on computer vision (CV) and natural language processing (NLP) to alleviate, improve, or solve the mentioned issues.A systematic survey of relevant technologies and applications is first conducted to provide a whole picture of the development in this area. This study focuses on reviewing NLP applications in construction since there are many review articles on CV but few on NLP in this area. To this end, 91 NLP-related research articles from 2000 to 2020 are collected for various scientometric analyses using VOSviewer and CiteSpace tools. These articles are then taxonomically summarized from datasets/data sources, technologies/tools, and applications&progress. The current challenges, possible solutions, and future research trends of NLP applications are discussed at the end of the investigation. This review indicates that data isolation is currently a severe problem of NLP applications in construction, and cross-modal interdisciplinary NLP applications are the future trends.
Second, a new image captioning model named vision-based bidirectional encoder representations from Transformers (V-BERT) is proposed for more accurate scene understanding, given few studies on image caption-based construction scene understanding, especially in the Chinese context. A new Chinese image caption dataset, construction images with Chinese captions (CICC), is constructed to train and validate the V-BERT model. Experimental results show that the V-BERT model achieves state-of-the-art (SOTA) performance in construction, with 69% and 171% relative performance improvement in sentence and key-element generation.
In addition to image captions, the outputs of many CV tasks can also be intermediate results connecting understanding and reasoning. However, current CV studies in construction are almost based on pure two-dimensional (2D) data (both raw data and labels are 2D), resulting in lacking three-dimensional (3D) spatial perception ability in field applications. Therefore, the third work of this study constructs a virtual construction vehicles and workers dataset with 3D annotations (VCVW-3D), which covers 15 indoor and outdoor scenarios and ten types of construction vehicles and workers. The VCVW-3D dataset is characterized by its multi-scene, multi-category, multi-randomness, multi-viewpoint, multi-annotation, and binocular vision features. The VCVW-3D dataset is expected to promote the development of 3D scene understanding in construction by reducing the costs associated with data acquisition, prototype development, and exploration of space-awareness applications.
Based on 3D bounding box annotations of the VCVW-3D, the perception and reasoning of proximity between workers and heavy construction vehicles are further conducted. Existing proximity monitoring methods are either too laborious and costly to apply extensively or lack spatial perception for accurate distance estimation. Thus, the fourth work proposes a novel framework for proximity monitoring using only an ordinary 2D camera, which integrates a monocular 3D object detection model and a post-reasoning module to identify four proximity categories: Dangerous, Potentially Dangerous, Concerned, and Safe. Experiments on virtual data show that the implemented system is real-time and camera carrier-independent, achieving an F1-score of roughly 0.8 for proximity category recognition within a range of 50 meters. This work preliminarily reveals the potential and feasibility of proximity monitoring using only a 2D surveillance camera, providing a promising and affordable way for early warning of human-machine collisions on construction sites.
Finally, a novel one-step framework of scene understanding and reasoning based on visual question answering (VQA) is creatively proposed. This one-step framework is applied to identify workers’ unsafe construction behaviors because existing automated safety compliance checking methods are inefficient using either two-step methods (i.e., reasoning based on the intermediate results) or heavy cross-modal models. A “rule-question” transformation and annotation system is formulated to turn workers’ unsafe behavior detection into a VQA task. The vision-and-language Transformer (ViLT) is adopted as the VQA model and fine-tuned on a newly created VQA dataset with 16 safety rules and 2,386 construction images. The judgment of whether there are hazardous construction behaviors in images is determined according to the output answers of the VQA model. Experiments show that the developed VQA model achieves an average Recall of 0.81 for workers’ unsafe behaviors with a reasoning speed of 92 questions per second.
| Date of Award | 4 Sept 2024 |
|---|---|
| Original language | English |
| Awarding Institution |
|
| Supervisor | Xiaowei LUO (Supervisor) |
Keywords
- Automated construction management
- Computer vision
- Natural language processing
- Scene understanding and reasoning
- Visual reasoning
- Image caption
- Three-dimensional object detection
- Proximity monitoring
- Visual question answering