Innovative Predictive Modeling of Property Appraisal: Emphasizing Coastline Proximity as a Key Factor

Assessing real estate property values involves an intricate process, taking into account numerous influencing factors. One such variable that may significantly impact property appraisal is the proximity to the coastline. This research was conducted utilizing a data set sourced from Vision Government Solutions, which contained information for 12,706 properties located in Branford, CT. After data cleaning, 6696 records that contained effective information were retained. We constructed three distinct linear models, each incorporating the property's distance to the coast differently: as a continuous numeric variable, as a binary categorical variable, and in a more intricate manner, as a categorical variable with multiple tiers. Out of these, the third model was the most effective in predicting appraisal prices. An intriguing prospect is the potential for this model to be generalized to other cities exhibiting similar characteristics to Branford. The subsequent report delves into detailed data analyses, the modeling process, and a discussion, outlining the advantages and potential concerns associated with our study.


INTRODUCTION
Real estate appraisal, the process of estimating property prices, is a pivotal element for both buyers and sellers [1].According to a report by the European Public Real Estate Association [2], real estate transactions contribute nearly 20% to economic activity.
The dynamics of real estate pricing captivate the public due to the intricate interplay of numerous variables, from the physical characteristics of the property to its geographical location.Our study specifically examines the influence of a property's proximity to the coastline on its market price.Properties next to the coastline, with their captivating environmental allures and the accompanying lifestyle that offer, are often viewed as premium assets.Our study seeks to explore the appraisal prices in Branford, distinctly different from assessment prices which often incorporate local tax laws and statutory requirements.
Drawing on the work of Ghosalkar and Dhage [3], we decided to use a linear model as our foundational approach, featuring several independent variables contributing to the estimation of the appraisal price.We propose three separate models, each employing a unique measure of a property's distance to the coastline -a continuous measure, a binary variable, and a more detailed categorical measure.Guided by the findings of Conroy and Milosch [4], we aim to refine the method of classifying properties based on their proximity to the coastline.We believe that the geographical location, particularly the proximity to the coastline, can significantly impact the appraisal price, considering other factors within the model prediction.
Our study illuminates this fascinating aspect of real estate valuation, with a special focus on real estate in Branford, a county distinguished by its beautiful coastline and vibrant real estate market.Leveraging linear regression models and considering a wide range of property characteristics, we introduce innovative ways to categorize properties based on their distance from the coastline.Furthermore, we investigate the potential applicability of our model beyond Branford, while acknowledging the potential limitations and areas for future exploration.

DATA 2.1 Data Collection
Our study involved the collection and preprocessing of data from Vision Government Solutions, Inc., a website hosting geographical and structural information about houses in Branford, CT [5].The data was freely accessible online; however, the precise number of unique pages was unknown, compelling us to automate the retrieval of pages 1 to 19,621.Following this retrieval process, we identified and excised 6,915 pages devoid of property listings.
Subsequently, we focused on extracting specific housing information from the remaining pages.Using pattern matching techniques, we compiled data on crucial variables of interest, such as Property ID, Assessment Price, Appraisal Price, Living Area Size, Year Built, Housing Percent Good, Size in Acres, Roof Cover Type, Model, Story, Bedroom, Bathroom, Half Bathroom, Building Count, Type of Air-conditioning/Heat, and Neighborhood Code.This collated information was subsequently cataloged in our data frame.
To enrich our analysis, we incorporated geocoding services to append latitude and longitude variables to most of the properties in the data set.However, there were properties that couldn't be geographically coded, and thus we removed 5,876 rows to mitigate the potential influence of unidentifiable locations on our analysis.
Given the city's distinguishing features, with a major highway and coastline, we hypothesized that proximity to these two features might impact housing prices in nearby locations.To explore this, we manually traced the route of I-95 and the coastline using Google Earth (refer to Figure 1).Utilizing the latitude and longitude information of the houses, we calculated the approximate shortest distance to these landmarks, incorporating this data into our data set.
This initial version of the data set equips us to examine the potential impact of location on property appraisal prices, allowing us to factor in the proximity to both the major transportation route and the coastline in Branford.

Data Cleaning
Our initial step of data cleaning involved identifying potential errors and discrepancies within the data.For instance, the representation of the number of stories varied, with some instances employing fraction notations (like 1 3/4 stories) while others utilized decimal notations (such as 1.75 stories).We homogenized these variations by converting all to decimal notation and stored them as numerical variables in our data set.Additionally, a specific entry indicating 20 half bathrooms was identified as a probable typographical error, according to the associated floor plan, and was adjusted to 2.  To align with our research question, we streamlined the data set, retaining only the 6829 rows that pertain to residential housing with a single building.We also pruned our data set by removing 2 properties featuring more than 9 bedrooms and 2 properties whose air conditioning type was recorded as "00" or "3", due to unclear information.Finally, we removed 118 rows with missing data, a marginal 1.73% of the original data set, assuming that their absence would not significantly skew subsequent analyses.
Upon further inspection, we identified that certain categorical variables were inadequately represented across all levels.To address this, we combined certain categories to ensure enough rows for each level of each categorical variable.For instance, with the " " variable, only 6 houses were listed as having unit air conditioning, a small number when compared to other air conditioning types.Consequently, we integrated this category with "partial AC", thus creating a new level: "partial/unit air conditioning".Additionally, for clarity in interpretation, we redefined the baseline of " " to represent "no air conditioning".Following these adjustments (refer to Table 1 and table 2), each level of " " was sufficiently represented, allowing us to proceed with our analyses utilizing this variable.The final data set that was used for our following analysis contains 6696 rows and 22 variables that record each property's price, multiple geographical information, and different housing features.

DATA EXPLORATION 3.1 The Importance of Coastline
Geographical characteristics often play a crucial role in determining the appraised value of a property.For instance, properties within popular neighborhoods typically command higher prices than those situated in other areas [6].Branford, the focus of our study, presents several intriguing geographical features that could potentially influence its property prices: the county is partitioned into 38 distinct neighborhoods, it is bisected by the I-95 highway running from east to west, and it is characterized by an attractive coastline to the south.Prior to constructing an empirical model to estimate appraisal prices, we deemed it necessary to first investigate which of these three geographical attributes might be most relevant to our study and which might wield a more pronounced impact on appraisal prices.
Figure 2 displays the location of the 38 distinct neighborhoods in Branford.Different neighborhoods appear to have some relationship with the appraisal price.Particularly, 17 neighborhood areas, numbered from 600 to 2000, are located in the southern part of Branford, which engulfs the entire coastline.Those areas also identify the majority of high-priced properties.
However, the demarcation of neighborhoods is based on geographical location rather than housing prices.When juxtaposed with the property price distribution map (refer to Figure 3), a significant variability within these 17 areas emerges: some properties exhibit substantially higher appraisal prices than others, a variance we can't account for if we overlook other geographical features in Branford.The properties with higher appraisal prices happen to be those close to the coastline, suggesting that the proximity to the coastline could attribute a premium to some houses within a neighborhood.This, in turn, potentially elevates the overall housing prices in the neighborhood when compared to others.Consequently, the influence of the coastline on housing prices is not adequately captured by simply considering neighborhoods.Therefore, we decided not to incorporate neighborhoods into our final model.
Another notable geographical feature in Branford is the I-95 highway.Figure 4 presents a scatterplot between housing prices and their proximity to the highway.At first glance, this appeared counterintuitive, as it seemed to defy conventional wisdom that suggests people prefer residences closer to major highways for commuting convenience.However, the plot reveals a unique trend in Branford: there's minimal price variation within 3000 meters of the highway, but a significant surge in appraisal prices becomes evident once the distance exceeds 4000 meters.This pattern is challenging to explain if we solely consider the distance from the highway.
Nevertheless, when we examine the relationship between distance to the coastline and distance to the highway (refer to Figure 5), an intriguing "V" shape pattern emerges: for properties within a 0 to 3000-meter radius of a highway, there is an inverse relationship between proximity to the highway and the coastline -as one decreases, the other increases.Conversely, for properties beyond this 3000-meter radius, an increase in distance from the highway corresponds with an increased distance from the coastline as well.This relationship corresponds to the geographical condition in Branford, where both the highway and the coastline follow a general east-towest direction.Therefore, some properties are located in between the highway and the coastline, and they account for the negative correlation between highway distance and coastline distance on the left side of the "V" graph; some properties are located in the north of the highway, and they explain the positive correlation on the right side of the graph.In conclusion, there exists a relationship   between highway and coastline distance, and although the distance to the highway might typically be a powerful variable to explain variances in housing prices, this may not hold true in Branford due to the modulating effect of the coastline.
Additionally, the coastline itself may clarify certain "anomalous" phenomena in the data.For instance, in Figure 6, the relationship between heating type and appraisal price may initially seem counter-intuitive, as some properties without heating have higher prices than others with heating.However, if we plot the properties on a map of Branford, color-coding them based on their respective heat types (refer to Figure 7), we could observe that several houses near the coastline with relatively high appraisal prices are identified as having no heat.In other words, when we considered the specific geographical feature and climatic context of Branford together with the relationship between heat type and appraisal price, the result became self-explanatory as those houses next to coastlines are most likely utilized as summer houses, during which time a heating system would be superfluous.Hence, these "no-heat" houses may command higher prices than others.Among all the distinct geographical features in Branford, the supposed effects of different neighborhoods and proximity to the highway on appraisal prices may actually be partially attributed to the distance to the coastline.Besides, the existence of a coastline can explain some intriguing trends in the data.Therefore, we would like to specifically consider the existence of a coastline and add houses' distance to the coastline into our analysis and model construction.

Exploratory Data Analysis
First, we would like to take a look at our response variable appraisal price and decide whether a transformation is needed.
Looking at the distribution of appraisal price, we found that real estate prices are typically skewed to the right.For example, there are a few real estate houses with unexpectedly high prices.Thus, we decided to conduct a log transformation on appraisal price to normalize the data so as to better find linear relationships between the response variable and explanatory variables (refer to Figure 8).
Then, we started looking at all numeric variables through a correlation matrix.As shown in Figure 9, in general, there is no very strong correlation among the possible explanatory variables.
We proceeded with plotting all numeric variables with the log of appraisal price to see if the variable is related to the price.
We originally plotted the relationship between the living area and the log of the appraisal price.This original representation, however, did not yield a clear relationship between the two variables.Therefore, we introduced a square root transformation to the living A similar analysis was conducted for other continuous variables, including percent good and size acres (refer to Figure 11).Upon examination of the scatter plot between percent good and log of appraisal price, there is a weak positive correlation between the two variables.However, in the graph of size acres, there was no discernible correlation between the original variable and the variable after square root transformation.Given the lack of clear correlation, the decision was to exclude "size acres" from subsequent analyses.
Then we proceed to plot each categorical variable in a boxplot.Starting with " ℎ ", there seems to be a generally positive trend indicating that as the number of bathrooms increases, the log of appraisal price also increases.Similar relationships were found in other variables, including " ", " ", " ", and " " as well (refer to Figure 12).

Variable Description
After exploring the importance of the coastline and completing some basic exploratory data analysis, we would like to consider using the following variables shown in Table 7 in the model:

MODEL
Housing prices are influenced by a myriad of factors in real life, leading to diverse analytical approaches in various research contexts.
As such, identifying a specific model that perfectly aligns with our investigation on the impact of coastline proximity on housing prices is challenging.Consequently, we have opted to utilize the linear model -the most commonly employed and universally applicable model -in our study.We have transformed certain variables to ensure the appropriate use of linearity structure, and we will operate under the assumption that each observation row is independent of the others, and that the errors are normally distributed with a mean of 0 and a constant variance.
The most challenging task of our study may be how to appropriately use the distance to coastline in our model, so in this section, we will consider three different ways of incorporating distance to coastline into the linear model for log appraisal price.

Model 1
Firstly, we constructed a model incorporating a log-transformation for the distance to the coast, as the scatter plot of price against distance to the coast indicated a decreasing trend with increasing distance (refer to Figure 13).By treating the distance to the coast as a continuous numerical variable, we aim to explore the effectiveness of this model construction in capturing the impact of coastal proximity on the appraisal price estimation.
The model form is: The is the natural log of the minimum distance from the property to the coastline.
√ is the square root of the living area, is the housing condition, which also means pct good, means the number of bedrooms, and means the number of bathrooms.
5 to 13 we use the indicator to indicate the style, since the style is the categorical variable.If the property belongs to one of these styles, the corresponding indicator is equal to 1, and all the other

Model 2
The value of residential properties exhibits a non-linear depreciation in correlation with increasing distance from urban greenspaces [7].In a similar vein, we hypothesize that a potential non-linear relationship may also exist between the natural logarithm of property appraisal and the distance to the coastline.That is, as the distance of a property to the coastline increases, the increase in log of its appraisal price might not be constant on average.In our improved model, we aimed to investigate better means to place distance to the coast as a factor in our prediction model.However, a crucial decision in this process was determining the cutoff distance at which a property would be classified as coastal or non-coastal.
To approach this, we systematically assessed a range of potential cutoff distances and evaluated how well each one explained the variance in property appraisals.We proceed to generate a sequence of potential cutoff distances, ranging from 100 to 6000 meters, in increments of 50 units.For each cutoff distance, we assigned properties a binary classification of being either coastal (1) or non-coastal (0) based on whether their distance to the coast was less than or equal to, or greater than the cutoff distance.For example, we started with setting the cutoff distance at 100 meters.In this case, we classified all properties whose distance to the coast was less than or equal to 100 meters as "coastal" (1) and those properties whose distance to coast was greater than 100 meters as "non-coastal" (0).Subsequently, a linear regression model was constructed for each cutoff scenario.The models aimed to predict the natural logarithm of a property's appraised value using the set of predictors that we decided to include.
The performance of each model was evaluated using the coefficient of determination (R-squared), which measures the proportion of the variance in the dependent variable that is predictable from the independent variables.In Figure 14, the R-squared values were tabulated alongside their corresponding cutoff distances.Through this procedure, we systematically examined a comprehensive set of potential cutoff distances and identified the one that provided the highest R-squared value: 200 meters.Therefore, we decided to use a 200-meter radius as a cut-off point to decide whether a property is coastal or non-coastal.Specifically, we categorize our properties into two groups: those located within 200 meters of the coastline and those located beyond this distance.And we use this new categorical variable to substitute the variable of the natural logarithm of property appraisal from our previous model.We posit that this change will allow us to better understand and model the influences of coastal proximity on property appraisal values.The model form is : The rationale behind this approach is the anticipation that as the distance of a property from the coastline increases, its impact on housing prices will likely diminish gradually.To account for this effect, the length of the intervals for each range is progressively increased.
In our final model, we employ the natural logarithm of the appraisal price as the dependent variable.The independent variables include the square root of living area size, the number of bedrooms, the housing condition percentage, the number of bathrooms, the housing style, the air conditioning type, the heat type, and the newly established categorical variable representing the distance to the coast.The form of the model is as follows: + 20 ( (50,80] ) + ... + 28 ( (2000,7000] ) + By incorporating these refined variables, we aim to gain deeper insights into the relationship between proximity to the coastline and housing prices, leading to a more comprehensive and accurate model.

RESULTS AND DISCUSSIONS
The three full models are displayed in Table 8, and we will briefly interpret and evaluate each model in the following sections.

Model 1
In Model 1, most variable coefficients are significant at the 1 percent level, with exceptions being the number of bedrooms, and air conditioning types of heat pump and partial/unit.This indicates that these variables might not have a significant impact on the appraisal price.
The lack of significance in the number of bedrooms aligns with the findings of Conroy and Milosch [4].A plausible reason for this might be that home buyers may value a larger overall living space over the number of bedrooms.In some instances, a higher number of bedrooms could imply smaller individual room sizes and a perceived inefficient use of the total property space.Consequently, this might explain the lack of significance of the number of bedrooms in all of our models.
By using the log-log coefficient s' case as elasticities introduced in Introductory business statistics [8], the coefficient of the variable of interest in this model is -0.1197, meaning that when other variables remain constant, if the distance increases 10% from the coast will determine a 1.19% decline in appraisal price.
Our results also indicate a diminished preference for heating systems in Branford properties.With winter temperatures in Branford typically ranging between 23 -46 degrees Fahrenheit, heating would generally be considered a necessity.One possible reasoning could be that potential home buyers in Branford may primarily intend to use their properties during the warmer seasons of spring and summer, thus making the installation of a heating system less important and a potential opportunity for cost-saving in property renovation.This finding aligns with our earlier hypothesis regarding the relationship between heating type and coastline in Section 2.
Regarding air conditioning systems, it appears that central air conditioning systems are preferred over heat pump or partial/unit types, at a 0.01 significant level.This conclusion is supported by two pieces of evidence.First, the actual number of installations for heat pump, partial, and unit air conditioners is much lower than that of central air conditioners.Secondly, the coefficients for both the heat pump and partial/unit types are negative, whereas the coefficient for the central air conditioner is positive.This suggests that the installation of a central air conditioner could potentially To facilitate a more straightforward presentation of the coastline distance coefficients, we consider a hypothetical property in Branford.This property is designed to approximate the median conditions of the area, featuring an area of 1800 square feet, 80% good housing condition, 3 bedrooms, 2 bathrooms, central air conditioning, hot water heating, and a ranch style.Table 9 displays the predicted prices from our model at different proximity levels to the coastline: There is a significant initial drop in the predicted property appraisal price, with the most significant change occurring as the distance to the coastline increases from (50, 80] to (80, 120] meters.In this scenario, an approximate 35-meter increase in distance from the coastline corresponds to a substantial decrease of approximately $90,500 in the predicted property appraisal -a notably considerable difference.
Subsequent increases in distance from the coastline lead to slower decreases in property appraisal price, with the last few rows of the table indicating that even large increases in distance are associated with smaller decreases in the predicted property appraisal, usually only a few thousand dollars.This further supports the nonlinear relationship between distance to the coastline and property appraisal price.
The final model achieved an R-squared value of 0.7198, demonstrating its effectiveness in classifying coastline distance with greater detail.

Diagnostics
In the pursuit of refining our predictive modeling, we introduce the diagnostics for Model 3 to ensure the assumptions we've made for building a linear model are proper.
As shown in Figure 15, the residuals plot has no obvious pattern, so the constant variance of errors could be ensured.In the normal QQ plot, we could see that the left and right tails of the line are off.However, considering the inherent complexities of the large, real-world dataset employed in our study, we do not expect a perfect normal condition for the distribution of errors.Overall, these deviations do not undermine the normality condition of our model.
In conclusion, the analyses indicate a reasonable degree of compliance with the core assumptions of linearity, independence, normality, and equal variance that underpin a linear regression model.

CONCLUSION
To conclude, the research investigates three distinct models for predicting the natural logarithm of the appraisal price.All of the models above incorporate variables, including the square root of the living area, number of bedrooms, proportion in good condition, number of bathrooms, property style, air conditioning type, heating type, and measures of the distance from the coast to the real estate.
The first model, Model 1, uses a continuous measure of distance to the coastline, providing a rough understanding of how distance to the coastline could play a crucial factor in estimating property prices in Branford.The second model, Model 2, simplifies the proximity to the coast to a binary variable.We were able to figure out our own means to decide a threshold for the distance to coast variable, 200 meters, and stratify all data points into those two levels.While this model slightly underperforms compared to Model 1, Model 2 offers a more straightforward understanding of the impact of coastal proximity on the appraisal price of the real estate in Branford.
The third model, Model 3, refined the proximity to the coast to a deeper level, segmenting all property into several distinct categories.This model proves to be the most effective among all models that we derived and successfully demonstrates the significance of the detailed categorization of distance to the coast on the appraisal price of the property in Branford.
Our study substantiates the finding from Conroy and Milosch [4].Nevertheless, an inherent advantage of our study compared to their research is that we did not arbitrarily delineate the distance to the coastline and segregate data into different proximity categories to the coastline.Instead, We applied the resulting cut-off point from Model 2 (200 meters) as a threshold to stratify data in Model 3. The distance within 200 meters was segregated with more precision, whereas the data beyond 200 meters was coarsely classified.
Our method of stratification data should better capture the realworld phenomenon that distances to the coastline significantly influences the appraisal price of a property in Branford when the property is in close proximity to the coastline.However, when a property is away from the coastline, the distance to the coast may not play as crucial a role.In these instances, other factors, such as living area, number of bathrooms, etc., may take precedence in determining the appraisal price.Though more investigation may be needed to monitor whether 200 meters would still be a proper threshold for Branford, considering new emerging properties and fluctuating real estate market, so as to further confirm this hypothesis.
One limitation of our study is the precision in our sketch of the coastline and highway.Given that we obtained them manually on Google Earth, we were able to provide an approximation of their exactitude when calculating the distance from the property to the coastline and highway.Greater precision could be achieved with the use of more exact data sources, for example, the location coordinates of coastline and highway from measurement, and therefore result in a more accurate reflection of reality.
Additionally, it is worth noting that our findings could possibly be generalized to counties/cities that share certain geographical characteristics with Branford, especially having a coastline and a highway that runs almost parallel to each other without any overlaps or intersections.In regions that do not fulfill the above conditions, our model could still examine the influence of the geographical location but might be more challenging to isolate the effect of highway and coastline on properties' appraisal prices.Future studies could aim to devise a more systematic and widely applicable method of stratification based on their proximity to the coast, including a more precise division below certain thresholds and a coarser categorization beyond these delineation points.

Figure 1 :
Figure 1: The Coastline and Highway in Branford (Yellow Lines)

Figure 2 :
Figure 2: Different Neighborhoods on the Branford Map

Figure 4 :
Figure 4: Log of Appraisal Price vs. Distance to Highway

Figure 5 :
Figure 5: Distance to Coast vs Distance to Highway

Figure 6 :
Figure 6: Heat Type vs Log of Appraisal Price

Figure 7 :
Figure 7: Heat Type in Real Estate on Branford Map

Figure 10 :
Figure 10: Log of Appraisal Price vs. Living Area and Square Root of Living Area Size

Figure 12 :
Figure 12: Categorical Variables vs. Log of Appraisal Price

.Figure 14
Figure 14: R-squared of the Model with Different Cutoff Points

Table 1 :
Types of Air Conditioning

Table 2 :
Types of Air Conditioning (after combining levels) Similar procedures were conducted on variable " " and "".The results are shown in Table 3, table 4, table 5, table 6.

Table 3 :
Heat Type

Table 4 :
Heat Type (after combining levels) Figure 3: Appraisal Price on Branford Map

Table 5 :
Building Style

Table 6 :
Building Style (after combining levels)

Table 7 :
Variables and Descriptions 14 to 16 is the indicator of air conditioning types, and 17 , 18 and 19 represent the heat type.

Table 9 :
Appraisal Price for Different Distance to Coastline