A New Approach to Assessing Perceived Walkability: Combining Street View Imagery with Multimodal Contrastive Learning Model

Walkability is becoming increasingly important in urban planning, public health, and environmental protection. Traditional assessment tools like streetscape images and semantic segmentation focus on objective factors, while questionnaires as the main tool for perceived walkability are limited by cost and scale. This study introduces a new method using the Multimodal Contrastive Learning Model, CLIP, to assess perceived walkability by analysing both tangible and subjective factors such as safety and attractiveness. The method compares perceived with physical walkability by scoring street view images with a customized scale. Initial results indicate CLIP can identify pedestrian-friendly streetscapes that might score low on physical metrics. While its accuracy needs more evaluation, CLIP offers a cost-effective alternative without needing extensive labelled datasets. This method can be combined with objective pedestrian assessment methods to serve as reference information for various industries such as real estate, transportation planning, and tourism.


INTRODUCTION
As global urbanization accelerates, walkability has moved beyond a purely transportation function to become a function of community connectivity, public health, and environmental protection.More cities are beginning to promote walking in practical ways, pushing for pedestrian-friendly neighbourhood environments.
Research on walkability has focused on objective environmental factors of walking.Early studies identified mesoscale factors such as residential density and land use as crucial determinants of walking behaviour [7].Technological advances such as streetscape imagery and semantic segmentation have enabled measuring street-scale peripatetic features [8,11].However, walking behaviour is also influenced by subjective walking intentions.For example, some narrow neighbourhoods may have actual high walkability due to cultural attractiveness but perform poorly on objective metrics (e.g., sky openness, degree of greenery, percentage of sidewalks, etc.).Therefore, perceived walkability should be considered in addition to the objective context when considering overall walkability.The current method of assessing perceived walkability is primarily questionnaires, and relatively authoritative questionnaire scales have been developed in this area such as NEWS and its derivatives [4,13,14], LWI [10], and PANES [3].These scales have been widely recognized and used.Streetscape imagery has also played a role in this area.The Place Pulse dataset, released by MIT Media Lab, is a pairwise comparison dataset collected through web-based research.Version 1.0 [15] contains three subjective dimensions, and version 2.0 [5] contains six.However, Place Pulse is positioned as a dataset about urban perceptions and does not fully reflect willingness to walk.A recent study [9] used a similar approach to constructing the Place Pulse dataset to publish a street view image dataset about concerns about walking preferences in Jeonju City, South Korea, developing a deep learning model to assess perceived walkability.However, the generalization ability of this model has yet to be validated.To summarize, the limitations of previous studies are apparent.Regarding physical walkability, the accuracy of the semantic segmentation model also needs to be further improved, and the walkability assessment cannot rely solely on semantic segmentation techniques.Regarding perceived walkability, the questionnaire survey method is challenging to apply widely due to geographical limitations, high time, and cost.Developing and training pairwise comparison datasets and derived deep learning models through web research are costly and cannot investigate the detailed factors affecting perception.Evaluating walkability by combining semantic segmentation models and object detection models is one solution idea, but the multimodal comparative learning model, CLIP, seems able to provide a more flexible and efficient solution.Furthermore, its zero-shot learning capability dramatically reduces the training cost of the model, offering the possibility of rapid deployment in a variety of urban scenarios.

The Potential for Perceived Walkability Assessment
Traditional walkability assessment methods, like semantic segmentation and questionnaires, are hindered by issues of cost, efficiency, and adaptability in complex environments.In the domain of deep learning, while models like ViLBERT and LXMERT [2] attempt to integrate vision and language understanding, it's the CLIP model [12] that stands out due to its superior zero-shot learning capability.The power of the CLIP model lies in its contrastive learning approach, which capitalizes on existing knowledge to decipher complex urban dynamics without extensive labelled data.Having been pre-trained on numerous image-text pairs, CLIP can deeply understand semantic relationships.This makes it possible to evaluate urban scenes using natural language cues like "wide sidewalks" or "good walking facilities" to more accurately reflect real-world settings.
The versatility of the CLIP model and its minimal data preparation needs set it apart from traditional methods.Nevertheless, challenges persist.These include handling systematic tasks such as counting and distance calculations, differentiating object types, and providing precise semantic similarity values for text-image pairings.Moreover, the model's sensitivity to phrasing means that iterative "just-in-time" optimization is essential for optimal performance.Despite its challenges, the CLIP model's adaptability and versatility ensure its relevance in the face of rapidly evolving urban environments.

Architecture and Training Approach of CLIP
The CLIP model consists of a visual encoder, often using Vision Transformer or ResNet, and a text encoder based on the Transformer architecture.During pre-training, these encoders align image and text pairs in a shared space by maximizing their similarity.This training allows CLIP to relate visual and textual data seamlessly.In inference, the model matches input labels to images, assessing the best fit.CLIP's design facilitates zero-shot learning, enabling it to handle new images without labelled data, reducing the need for vast datasets and increasing its versatility across tasks.

Perceived Walkability Assessment
Ewing's theory [6] suggests that willingness to walk is based on objective environmental factors.They argue that physical features can directly influence individual reactions or indirectly affect them through urban design qualities, ultimately determining the overall walkability.This provides a theoretical basis for the development of a perceptual scale based on detailed features of the physical environment in this study.Figure 1 depicts the computational process for assessing perceived walkability scores.Constructing an assessment scale based on objective environmental factors is crucial in the process.The In other words, the larger probability value obtained means that the image better matches the corresponding metric.To avoid the single dependency of the calculation results, the positive and negative indicator entropy is additionally introduced as an adjustment strategy.Its specific calculation is as follows: In equation ( 1), () is the entropy of indicator, (  ) is the probability that the image corresponds to the i-th indicator.In equation ( 2),   is the direction of the i-th indicator (+1 for positive direction and -1 for negative direction) and (  ) is the image corresponding to the i-th indicator.�  � is the entropy of the positive indicator and �  � is the entropy of the negative indicator.The purpose of this adjustment strategy is to give an additional reward to those images that satisfy multiple positive metrics simultaneously and to impose a corresponding penalty on those images that satisfy multiple negative metrics.It ensures that images with higher scores are not just higher because of a better match to one indicator but because of a better match to multiple positive indicators.

CASE STUDY
Mapillary is an open source street view image platform which provides image metadata [1] that allows users to filter images by timestamps.Experimental results of the method are shown for the Centrum district of Amsterdam.Street View images are collected at 30-meter intervals on the road network.A total of 5,669 images were collected.To ensure consistency in the urban landscape, the image was limited to April through October of each year, and the images were taken within the last five years.Figure 2 shows some examples of the perceived walkability score.The Top 3 Labels with Probability are the three labels in customized rating scale (Appendix 1) that the model thinks best fits/ describes the image.
In order to show more clearly the sensitivity of the present method to perceived factors, this study additionally calculated the physical walkability of the study area using a traditional semantic segmentation method -the DeepLab V3 model has been used here.Four main factors (visual crowdedness, greenery, sky openness and sidewalk ratio) with calculation formulas are shown in Appendix 2. The weights of the four indicators were assigned using hierarchical analysis (Appendix 3).The calculated physical and perceived walkability score results were visualized as a heat map (Figure 3).As shown in Figure 3(a), the "ARTIS" zoo in southeast Amsterdam is an important physical walkability hotspot due to its vast open green spaces.Other hotspots are concentrated on major urban arteries and intersections, such as Amsterdam Central Station, while other areas have relatively similar walkability scores.In contrast, the Figure 3(b) shows a more complex distribution.In addition to the ARTIS Zoo, the city centre business district and the famous "red light district" are also perceived walkability hotspots.This suggests that the methodology captures micro urban features that influence perception, such as commercial activities and landmarks, which are difficult to detect in a purely physical assessment.

DISCUSSION AND CONCLUSION
The above results show that the perceived walkability approach proposed in this study has significant advantages in the following two aspects: 1.It can provide a more comprehensive assessment of walkability due to its ability to identify more details of the streetscapes, especially for streets where cannot be measured uniformly using objective metrics.This perceptual perspective-based approach to walkability assessment can Meanwhile, the deterministic nature of the CLIP model ensures its robustness and reproducibility, meaning it consistently produces the same results for the same input, highlighting its reliability.This approach also avoids the time costs associated with traditional questionnaires, making the evaluation process more efficient.
Although this method demonstrates its potential, some aspects still need to be improved.First, the accuracy of the computational results is limited by the performance of the CLIP model, and a reasonable method needs to be developed to assess the performance of the model.In addition, there is still room for optimization of the evaluation scale of perceived walking ability.In addition to determining the direction of each indicator, its weight allocation can be refined to make the assessment results more targeted.Finally, considering the diversity of perception, specialized assessment scales can be designed for different populations and cultural backgrounds in the future.Benefiting from the training strategy of contrastive learning, the CLIP model is equipped with zero-shot learning capability, thus demonstrating excellent generalization ability, allowing it to adapt to different urban scenarios.Considering its core features of efficiency and low cost, this lightweight assessment method can be considered for future integration into websites or applications.This not only provides a real-time walkability assessment tool, but also opens up new research directions in urban planning, transportation engineering, and other related industry fields.
It is worth emphasizing that the core idea of this method can be widely applied to a variety of perception studies based on the generation of objective factors as long as reasonable and scientific assessment criteria are developed.

Figure 1 Flow
Figure 1 Flow Chart for Calculation Perceived Walkability Score

Figure 2
Figure 2 Examples of Perceived Walkability Calculation

Figure 3
Figure 3 Heatmap of Physical (a) and Perceived (b) Walkability Result