Urban Last Mile Delivery Data Mining for Performance Improvement

This paper presents an in-depth exploration of data mining techniques aimed at optimizing the operational efficiency of urban last mile delivery services. Leveraging an authentic industry dataset graciously provided by a collaborative logistics partner, this study meticulously unravels intricate delivery patterns and discerns clusters attributed to delays, employing advanced cluster analysis methodologies through the utilization of the WEKA software suite. From our initial cluster analysis, we identified specifically that 33% of late cases occurred in the latitude range of (1.277, 1.287], 36% occurred in the longitude range of (103.843, 103.855], and that Driver 2065 was involved in 13% of late cases. Furthermore, a pioneering route analysis paradigm is introduced, elucidating an implementation framework harnessed through Python, Pandas, Folium packages, and the Open Source Routing Machine (OSRM) API. Through our route analysis, we were able to visualize the historical routes taken by drivers and the recommended routes by OSRM for their given jobs. In the case of Driver 2065, this allowed us to identify visits to non-job locations and extended durations spent at high-rise and high-density buildings. Notably, this research surmounts the challenge posed by imprecise GPS coordinates for job locations by propounding an innovative approach to location estimation. This groundbreaking technique bestows the capability to compute pivotal parameters, encompassing travel time and service duration, which aptly characterizes the temporal allocation at each discrete job locale. The culmination of our scholarly pursuits begets profound insights, effectively serving as a guiding compass to engender tangible operational enhancements and methodical finesse in the domain of delivery operations, thereby ensuring the punctilious execution of time-sensitive deliveries.


INTRODUCTION
The field of Data Mining has emerged as a vital tool for deriving meaningful insights from vast datasets, facilitating informed decision-making across a spectrum of enterprise and supply chain scenarios.For instance, its application extends to optimizing inventory management, fuel consumption, analysing oil price movements, and the prediction of oil production, among others [1] [2] [3] [4].Particularly pertinent to delivery-oriented businesses, data mining plays a pivotal role in enhancing job performance, yielding benefits such as heightened customer satisfaction and curbed economic losses arising from job failures.These enhancements are of particular significance in the realm of Last Mile Logistics, as this phase of the supply chain often proves to be the least efficient and cost-intensive [5].Amid escalating market competition [6], operational efficiency becomes a cornerstone for cost reduction and the preservation of customer satisfaction [7], thereby safeguarding revenue streams.
In this context, our study is driven by the imperative to unearth actionable insights that can systematically enhance delivery performance, focusing on a real-world industry case.Our dataset originates from a third-party logistics service provider (3PL) encompassing their fleet of vehicles and drivers.Our investigation centers on identifying the causes behind delayed delivery jobs, spanning both pick-up and drop-off tasks.The 3PL's diverse vehicle fleet comprises vans, cars, and motorbikes, catering to a range of goods, from letters to medium-sized parcels.Each vehicle embarks from the central depot at predetermined times, traversing designated job locations for pick-ups and drop-offs.An internally developed Vehicle Routing Engine guides the drivers, its routes adapting dynamically based on real-time job completions.
The dataset is bifurcated into two core components: job attempt information and GPS data.The former entails crucial details like job ID, latitude, longitude, timestamp, driver ID, and completion status, while the latter encompasses driver ID, latitude, longitude, and timestamp.
Our approach embraces both cluster analysis and route analysis.For cluster identification, we harness the user-friendly WEKA software, which streamlines dataset filtering and visual distribution representations.For route analysis, we undertake a simulation employing Python, Pandas, Folium packages, and the Open Source Routing Machine (OSRM) API.A salient feature of our approach lies in our innovative route analysis methodology.Given that the GPS data may lack precise job location coordinates, we devised a method to infer these locations, thereby enabling the segmentation of GPS data into vehicle routes between successive job locations.Moreover, this allowed us to estimate travel times between job sites and service times, reflecting the duration spent at each location.

LITERATURE REVIEW 2.1 Cluster analysis
Cluster analysis encompasses several techniques [8] [9], notably K-means clustering [10], hierarchical clustering [11], and densitybased clustering.Of particular interest in our study is densitybased clustering, which defines clusters over contiguous regions with high point densities [12].In our study, we employ a densitybased approach to discern clusters within visualizations of the distributions of datapoints from our dataset, which are generated using the WEKA software.

Route analysis
To gain clearer insights into vehicle travel patterns and facilitate more in-depth analysis, it becomes essential to pinpoint job locations within the GPS data.Notably, Ma, Xiaolei, et al. [13] introduced anchor points for truck trip chains through a spatial, density-based clustering algorithm on GPS data, successfully identifying frequently visited points by their fleet.However, our context differs significantly, as we focus on the entire paths traversed by individual vehicles following unique travel plans.In contrast, Ma et al. 's study centered on common points frequented by their fleet, rendering their method less applicable to our scenario.Another relevant study by Sharman, Bryce W., and Matthew J. Roorda [14] explored hierarchical agglomeration and partitioning clustering methods to ascertain trip destinations, akin to our context of individual vehicle visits to planned destinations.Yet, disparities emerge concerning dataset characteristics.While they inferred trip ends through vehicle engine on-off status or vehicle stationary periods, our dataset lacks engine status and vehicle GPS information.Instead, our data entails driver GPS information post-vehicle parking, as drivers facilitate deliveries.This intricacy compelled us to develop an innovative approach to estimating job locations within the constrained GPS dataset.
Consequently, we devised a novel method to estimate job locations, effectively navigating the limited GPS information by leveraging job attempt data.This innovative approach ensures consistency in identifying job locations across both segments of the dataset.

DESCRIPTION OF DATASET
The dataset comprises two distinct components: 1) JOB_INFO and 2) GPS_DATA.The former, JOB_INFO, encompasses job attempt details, while the latter, GPS_DATA, comprises GPS information collected from drivers at regular intervals.Tables 1 and 2 delineate the attributes characterizing each segment, accompanied by an exemplar entry.

PROPOSED METHOD
We use CRISP-DM methodology for this study.The details are described as the following subsections.

Preprocessing
The initial dataset, JOB_INFO, underwent augmentation with the inclusion of the subsequent enhancements: • ATTEMPTED_DAY: Intended for exploratory data analysis purposes.
• ATTEMPTED_TIME: Incorporated for exploratory data analysis endeavors.• IS_LATE: Introduced to distinguish between late and nonlate jobs.
The new attributes 'ATTEMPTED_DAY' and 'AT-TEMPTED_TIME' are conveniently derived from the existing 'ATTEMPTED DATETIME' field.Meanwhile, 'IS_LATE' is established based on the contents of the 'COMPLETION_DESCRIPTION' attribute, assuming a Boolean value (True or False) contingent upon whether the term 'late' is present in the corresponding entry.In addition, entries featuring 'LAT' = 0 or 'LONG' = 0 were removed, deeming them as negligible anomalies.The configuration of the modified JOB_INFO, along with an illustrative entry, is delineated.

Cluster analysis
Subsequently, we initiated a cluster analysis using the WEKA software to unveil visualizations illuminating the distribution of features within JOB_INFO.To distinguish between late and non-late jobs, we designated 'IS_LATE' as the class variable.With an aim to mitigate the occurrence of late cases, we delved deeper by focusing on late instances.Our exploration then shifted towards identifying features with conspicuous spikes in their visual patterns, indicating potential clusters of late jobs under specific conditions.Notably, we directed our attention towards 'LAT, ' 'LONG, ' and 'DRIVER_IDEN-TIFIER' features, probing the interplay among late jobs, individual drivers, and geographical coordinates.

Route analysis
In the preceding section, our focus successfully narrowed down instances of late jobs to specific drivers and geographical regions, as defined by latitude and longitude ranges.To glean deeper insights from the available dataset, we embarked on a route analysis specifically targeting these drivers and regions.This endeavor involved examining the routes traversed by these drivers between job locations within the regions prone to late deliveries and their associated circumstances.The overall process and steps undertaken in our route analysis are outlined in Figure 1.To execute the route analysis, we designed a driver's travel pattern simulation for a single day.This procedure commenced with the identification of the driver's job locations.Our approach involved filtering the GPS_DATA to extract entries pertinent to the designated driver and day.Subsequently, we filtered the JOB_INFO data by the same driver and day, thereafter sorting all relevant job attempts based on the 'ATTEMPTED DATETIME' feature in ascending order.We created custom 'JobLoc' objects to represent individual job locations.An illustration of a 'JobLoc' object is presented in Figure 2, and the key attributes of this object are detailed in Table 3.Through iteration, we populated each 'JobLoc' object with its respective job attempts, establishing a sequence of these objects, each representing a distinct job location.
With job locations identified, our attention shifted to pinpointing those within the regions where frequent late jobs occurred, a region referred to as the 'late rectangle' for simplicity.Having established the relevant job locations, we proceeded to simulate the driver's travel pattern by iteratively analyzing filtered GPS_DATA.Given the chronological sorting of GPS_DATA entries, an additional sorting step was unnecessary.
The determination of the driver's arrival at a job location relied on the Haversine formula, calculating the distance between the driver's present location (extracted from GPS_DATA) and the job location (from the corresponding 'JobLoc' object).When the calculated distance fell within a predefined threshold, the driver was considered to have reached the job location.Consequently, attributes such as 'recorded_entry_time, ' 'recorded_entry_loc, ' and 'recorded_en-try_idx' were assigned to the respective 'JobLoc' object based on the current GPS_DATA entry.Subsequent iterations marked the driver's departure from the job location when the distance threshold was met again, resulting in the assignment of 'recorded_exit_time, ' 'recorded_exit_loc, ' and 'recorded_exit_idx' attributes.By maintaining synchronization between JOB_INFO and GPS_DATA through the 'visit_idx' attribute, we accurately simulated the driver's travel pattern between job locations.The interval between entry and exit times represented the estimated service time for each job, with the circular region around the job location demarcating the service region.This cumulative information allowed us to approximate job locations within GPS_DATA, with each 'JobLoc' object documenting these estimations.The resulting job locations were visually identified on a map, and path connections between them were plotted using the Python Folium package.
For generating a suggested route by OSRM (Open Source Routing Machine) encompassing a driver's job locations, the OSRM API was employed in conjunction with the 'lat_long' attribute of the 'JobLoc' objects.This process entailed retrieving routes between consecutive job locations and linking them to outline the recommended path from the initial job location within the 'late rectangle' to subsequent ones, preserving the order specified by the 'visit_idx' attribute.
To extract the actual path taken by the driver during the day, individual paths between specific job locations of interest were determined and aggregated.The connection between two job locations was established by identifying the corresponding rows in GPS_DATA.Each connection was visualized as a line joining the associated coordinates.To enhance visibility of driver positions along the path, markers were virtually placed at regular time intervals.By employing a predetermined interval and computing the time difference between successive GPS_DATA rows, markers were introduced to the plot at intervals roughly equivalent to the specified time interval.
In summary, our comprehensive route analysis facilitated the estimation of job locations, visualization of drivers' paths, and the determination of suggested and actual travel routes, ultimately providing a dynamic view of drivers' activities during their journeys.

Results of cluster analysis
Our initial step involved filtering based on the condition 'IS_LATE' = True, thus eliminating non-late jobs, as our focus lay in uncovering patterns within late jobs.Our intent was to delve into the interplay among late jobs, individual drivers, and geographical data, suspecting that a significant portion of late job occurrences could be attributed to locality or personnel-related issues.In this pursuit, we scrutinized the 'LAT', 'LONG', and 'DRIVER_IDENTIFIER' features within WEKA visualizations to discern conditions giving rise to notable spikes in late cases.
It is worth acknowledging that the precise definition of a spike substantial enough to warrant our attention through visual inspection is not immediately apparent.Intuitively, such a spike might occur when a substantial proportion of cases within the overall dataset cluster within a narrow range for a specific parameter.To translate this intuition into a quantitative selection criterion, one approach involves examining the number of late cases occurring within the specified parameter range, relative to the total number of late cases.This computation is facilitated by WEKA's visualization features, which provide case counts for specific ranges.In general, our selection of parameter ranges corresponding to significant spikes encompasses at least 10% of the total late cases.Simultaneously, comparable ranges that adhere to similar lengths would yield proportions significantly below half of this value, unless they too qualify as spikes.
As illustrated in Figure 3, substantial spikes in late cases are evident when: This approach provides us with quantitatively grounded criteria for identifying notable patterns within late job occurrences, allowing us to effectively focus our investigation on specific regions and driver attributes where significant late cases cluster.
Our investigation commenced by initially filtering based on the condition 'IS_LATE' = True, thereby excluding non-late jobs, in pursuit of discerning distinct patterns within late job occurrences.
Our focus was on unveiling potential associations involving late jobs, individual drivers, and geographical attributes.As it seemed plausible that a substantial portion of late jobs might be influenced by locality or driver-related factors, we specifically examined the 'LAT', 'LONG', and 'DRIVER_IDENTIFIER' features within WEKA visualizations.Our aim was to unravel circumstances giving rise to pronounced spikes in instances of late deliveries.It is worth noting that determining the precise magnitude of a spike meriting attention through visual inspection proved nontrivial.We sought a quantitative foundation for our selection criterion, which we achieved by evaluating the number of late cases within a specified parameter range as a fraction of the total late cases.This calculation was facilitated by WEKA's inherent visualization capabilities, offering insights into case counts across parameter ranges.
In general, the parameter ranges delineating significant spikes encompassed at least 10% of the total late cases.Conversely, parameter ranges of similar lengths yielded proportions notably below half of this benchmark, unless they also constituted spikes.
This analysis indicated that certain conditions lead to higher likelihoods of late deliveries: In a subsequent refinement, we focused on the latitude range (1.277, 1.287] as guided by the prior insight.The analysis unveiled that the most substantial spikes in late cases occurred within specific sub-ranges: • 'LAT' = This analysis showcases a strategy for identifying prominent late regions among drivers in the dataset.A similar approach can be iteratively employed to unveil secondary late regions by consecutively eliminating late orders from higher-frequency late regions, and subsequently subjecting the remaining jobs to clustering analysis.This iterative process aims to capture all significant late regions.By leveraging this methodology, we can conduct route analysis on jobs within these regions, further enhancing the precision and effectiveness of our analysis.

Results of route analysis
Presenting the plotted trajectories suggested by OSRM and the actual paths traced from GPS_DATA for Driver 2065, Driver 2169, and Driver 2811 on January 20, 2023, we offer the visual representations depicted in Fig. 4, 5, 6, and 7.
In addition, our analytical capabilities extend to evaluating travel time through insights derived from the OSRM API's recommendations.As for the actual path, a straightforward estimate of travel time can be readily attained by computing the time difference between the 'recorded_entry_time' and 'recorded_exit_time' attributes associated with the corresponding JobLoc objects.Upon meticulous scrutiny of Figure 4 and Figure 5 pertaining to Driver 2065, a conspicuous discrepancy surfaces within the designated late A closer examination of the map reveals prominent clusters of red circles, clearly demarcating instances where the driver spent significant amounts of time at locations.These clusters notably manifest around prominent CBD landmarks, such as the Wing On Life Building and the UOB Plaza towers, prominently illustrated in Figure 6 and Figure 7 respectively.This analytical insight penetrates the intricate dynamics governing the driver's movements, casting light on recurrent deviations from the prescribed path, characterized by: -Visits to non-job locations -Extended durations spent at high-rise and high-density buildings.
Conversely, a comparable finding emerged for Driver 2169, where prolonged stays at high-rise and high-density buildings feature as a key observation.
Significantly, a parallel observation is unveiled for Driver 2811, reinforcing the recurrent pattern of extended durations spent at such high-rise and high-density locations.These key findings furnish invaluable insights into the drivers' behaviors, revealing their

Discussion
A common theme for the drivers was the visiting of high-rise, highdensity job locations.The drivers typically spent much time at these places, judging by the density of red circles in the plots.Also it could be the case that vertical travel time (i.e.time for drivers travelling up and down the buildings) could be very long in these locations.This long and uncertain vertical travel time for intensive jobs may be a major cause for high frequency of late jobs there.
The actual total travel time of the drivers from their first job to their last job in the late rectangle is significantly greater than that suggested by OSRM.This suggests that much time was spent travelling by the drivers outside of travel times between job locations.This supports the notion that drivers might be spending much time travelling within buildings -something that OSRM cannot capture.Another common finding was that the drivers often visited non-job locations between jobs in the late rectangle.Naturally, this might cause delays subsequently and late jobs.Interestingly, many of these non-job locations are food establishments.The drivers tend to visit such places in the early afternoon, suggesting that they might be visiting these places to buy their lunch.

CONCLUSION AND FUTURE WORK
This paper presents an innovative approach to estimating job locations from GPS data within a comprehensive data mining framework aimed at enhancing delivery performance for a logistics partner.Notably, the analysis highlights that late jobs are influenced by factors such as high-rise, high-density locations and visits to non-job sites.Addressing these factors holds the potential to reduce late deliveries and enhance overall performance for the logistics partner.
Future work could refine our approach by developing an adaptive distance threshold to address the accuracy impact of overlapping service regions in our method.Mitigating this challenge requires careful threshold calibration for optimal simulation accuracy without underestimating service times.Additional opportunities include creating statistical models for location-dependent service times and non-job events (e.g., lunch breaks) based on historical data.Integrating these models into logistics planning can enhance delivery operation efficiency.

Figure 1 :
Figure 1: Steps in route analysis

Figure 2 :
Figure 2: Illustration of estimation of a job location with a JobLoc object.Table 3: Key Attributes of JobLoc object

•
Jobs occurring within latitude range (1.277, 1.287] or longitude range (103.843,103.855] demonstrate increased lateness.• Jobs attributed to Driver 2065 are more prone to being late.

Figure 6 :
Figure 6: Driver 2065 -cluster at Wing On Life Building