Spatial Clustering Regression of Count Value Data via Bayesian Mixture of Finite Mixtures

Investigating relationships between response variables and covariates in areas such as environmental science, geoscience, and public health is an important endeavor. Based on a Bayesian mixture of finite mixtures model, we present a novel spatially clustered coefficients regression model for count value data. The proposed method detects the spatial homogeneity of the Poisson regression coefficients. A Markov random field constrained mixture of finite mixtures prior provides a regularized estimator of the number of clusters of regression coefficients with geographical neighborhood information. As a by-product, we also provide the theoretical properties of our proposed method when the Markov random field is exchangeable. An efficient Markov chain Monte Carlo algorithm is developed by using the multivariate log gamma distribution as a base distribution. Simulation studies are carried out to examine the empirical performance of the proposed method. Additionally, we analyze Georgia's premature death data as an illustration of the effectiveness of our approach. The supplementary materials are provided on GitHub at https://github.com/pengzhaostat/MLG_MFM.


INTRODUCTION
Spatial regression models have been widely used in various fields, such as environmental science [16,38,39], biological science [41], and econometrics [5,40], to explore the relationship between a response variable and a set of predictors over a region.One of the key objectives of spatial regression models is to capture the spatial dependence structure of the response variable.Spatial random effects are typically incorporated through intercepts, while the regression coefficients are assumed to be constant over space in both linear models [7] and generalized linear models [11].To capture spatially varying patterns in regression coefficients, [5] proposed geographically weighted regression (GWR).This idea has been further extended in subsequent works by [17,27,37].Additionally, [14] incorporated spatial Gaussian processes into linear regressions to construct a spatially varying coefficients regression model.However, these approaches assume that each location has its own set of regression parameters, which can sometimes lead to overfitting.The detection of clustered covariate effects has significant benefits in various fields, including environmental science, spatial econometrics, and disease mapping.For example, different regions of a country may exhibit distinct economic conditions and development patterns.From a modeling perspective, grouping more advanced regions and less developed regions into separate clusters can lead to a more parsimonious model.

Related Work and Challenges
Spatial cluster detection methods, such as the scan statistic-based method [19,20], provide a solution for detecting spatial heterogeneity.Another important approach for spatial heterogeneity detection is to utilize the Bayesian framework to identify spatial clusters [6,25].These two approaches primarily focus on estimating cluster configurations of spatial responses.Recently, methods for detecting clusters of spatial regression coefficients have been proposed to assess the homogeneity of covariate effects among sub-areas [22,23] using spatial scan statistics.From a graph theory perspective, [24] incorporated spatial neighborhood information based on minimum spanning trees into a penalized approach to identify spatially clustered coefficients.However, the existing literature mainly focuses on Gaussian data within the linear model framework.In many social and environmental applications, Poisson regression for count responses plays a crucial role [4].
Developing clustering algorithms for regression coefficients under Poisson models presents several significant challenges.First, specific spatial contiguity constraints need to be imposed on the clustering configuration to facilitate interpretations in the spatially clustered coefficients regression.Additionally, in many regional science applications, spatially contiguous constraints should not dominate the overall clustering configuration.In other words, the clustering results should include both spatially contiguous and spatially disconnected patterns.While the aforementioned methods [22][23][24] guarantee spatial contiguity, they fail to obtain globally discontiguous clusters that allow two clusters with long geographical distances to be considered part of the same cluster.Furthermore, [1] discusses Poisson regression with spatially clustered intercepts and slopes but does not impose a spatial contiguity constraint.
Another crucial consideration in clustering algorithms is the estimation of the number of clusters.Bayesian inference provides a probabilistic framework for simultaneous inference of the number of clusters and the clustering configurations.Nonparametric Bayesian approaches, such as the Dirichlet process mixture (DPM) model [12], offer choices to estimate the number of clusters and the clustering configurations simultaneously.However, the Dirichlet process mixture model suffers from inconsistency issues [28], which leads to challenges in obtaining a consistent estimator of the number of clusters.To address the problem of over-clustering in the DPM, several approaches in the literature [26,29,36] propose different ideas to obtain consistent estimators of the number of clusters.However, these existing methods do not utilize spatial information, such as neighborhood relationships, despite the potential to enhance clustering performance.According to Tobler's first law of geography [33], "Everything is related to everything else, but near things are more related than distant things." Considering a similar pattern in the data due to similar environmental circumstances is reasonable.Although it may be challenging to incorporate arbitrary types of spatial information with consistent guarantees on the number of clusters, exploring specific types of dependency structures without compromising consistency is still of interest.

Contributions
To address these challenges, we propose a novel approach called the Markov random field (MRF) constrained Mixture of Finite Mixtures (MFM) model (MRF-MFM) to capture the spatial homogeneity in regression coefficients for Poisson models.Our Bayesian method for spatially clustered coefficients Poisson regression incorporates geographical information using the MRF constrained MFM model.This enables us to capture both locally spatially contiguous and globally discontiguous clusters simultaneously.We develop a Gibbs sampler that facilitates efficient full Bayesian inference on the number of clusters, mixture probabilities, and other modeling parameters for Poisson regression, leveraging the multivariate log gamma (MLG) process [4].Through simulations and an analysis of premature deaths data in the state of Georgia, we demonstrate the excellent numerical performance of our proposed mixture models.Additionally, we establish a consistency result for the posterior estimates of the cluster number and associated modeling parameters when the Markov random field is assumed to be exchangeable.
Our proposed method has several unique aspects.First, the introduction of the Markov random field into the mixture of finite mixtures model for spatial cluster coefficients regression is a novel idea.This approach has wide applicability in spatial statistics applications, such as environmental science and geographical analysis, providing a valuable alternative to existing literature that primarily relies on penalized regression or scan statistics.We provide a detailed comparison with related literature (e.g., [1,24]) in Section D of the supplement.Second, by adopting a full Bayesian framework, our clustering results offer useful probabilistic interpretations.Moreover, our developed posterior sampling scheme ensures efficient computation.Third, our theoretical result is among the first of its kind for mixture models under the exchangeable assumption.The posterior consistency result not only justifies the excellent empirical performance (e.g., regularization on the number of clusters) but also connects with existing theoretical findings on mixture models in general.

METHODOLOGY 2.1 Clustered Poisson Regression and Mixture of Finite Mixtures
Consider a Poisson regression model with spatially varying coefficients as follows where  (  ) is a × covariates matrix,   ∈ {1, • • • ,  } are labels of clusters,    =  (  ) is a  dimensional regression coefficients at location   .From [14], a Gaussian process prior can be assigned on regression coefficients to obtain spatially varying pattern.Compared with spatially varying pattern, heterogeneity pattern of covariate effects over subareas is also universally discussed in many different fields, such as real estate applications, spatial econometrics, and environmental science.
In the popular Chinese restaurant process,   ,  = 2, . . .,  are defined through the following conditional distribution [12]: where | | is the size of cluster , and  is a concentration parameter of Dirichlet Process.While CRP has a very attractive feature of simultaneous estimation on the number of clusters and the cluster configuration, a striking consequence of this has been recently discovered [28] where it is shown that the CRP produces extraneous clusters in the posterior, leading to inconsistent estimation of the number of clusters even when the sample size grows to infinity.A modification of the CRP called Mixture of finite mixtures (MFM) model is proposed to circumvent this issue [29]: where  (•) is a proper probability mass function on {1, 2, . . ., },  is a concentration parameter of Dirichlet Process and  ℎ is a point-mass at ℎ. Compared to the CRP, the introduction of new tables is slowed down by the factor   ( + 1)/  (), which allows a model-based pruning of the tiny extraneous clusters.
where  is the number of existing clusters.

Introducing Dependency on the Base Measure
Recall that the full model for MFM is where  is the base distribution for .The main insight of MFM is introducing a prior on the length of the Dirichlet distribution, and thus renders some regularization on the number of clusters created.However, the fourth step in the model, where i.i.d.samples are obtained from a base measure, fails to incorporate any dependency structure.
Inspired by [31], we apply the pairwise MRF in the level of coefficients to bring in interactions.With the assistance of Markov random field modeling, our MRF-MFM can incorporate more broad types of base measures.Consider an undirected random graph  = ( , , ), where  = { 1 , . . .,   } is the vertex set while  is the set of graph edges, with weights  on the corresponding edges.Each vertex   is associated with a random variable   for  = 1, 2, . . ., .The pairwise MRF model is defined as where   is the normalizing constant.For example, for a Gaussian MRF,   (  ) = − where C 2 := { ∈ C | . .: | | = 2} and C is the set of all cliques for the random graph ( , , ).For the spatial clustered coefficient regression, we study the component  defined in equation ( 8) with a MFM prior.Our next theorem provides the generalized urnmodel induced by MRF-MFM, thus a collapsed Gibbs sampler can be applied.
Theorem 2.1.Suppose the data generating process follows equation (6) with  replaced by the Markov random field Π( 1 , ...,   ) in equation (8).If  is a continuous distribution and  0 > 1, the distributions of   0 given  1 , . . .,   0 −1 is proportional to with (); where  * 1 , . . .,  *  ,  ≤  0 −1 are the distinct values taken by  1 , . . .,   0 −1 and This theorem shows how the MRF constraints directly affect the urn sampling scheme compared with MFM.Considering the pairwise interactions, we model the conditional cost functions as where  is the smoothness parameter, () denotes the set of the neighbors of observation .The spatial smoothness can be controlled by the magnitude of .When  = 0, the MRF-MFM reduces to MFM [29].The conditional cost function in (10) is used in the data analysis of the paper.

Spatial Clustered Coefficient Regression for Count Value Data
In the MRF-MFM, the multivariate normal distribution is a natural choice for the base distribution of  1 , • • • ,   .However, since the multivariate normal distribution is not a conjugate prior for Poisson regression, if it is used as the base distribution, it must be updated with Metropolis-Hastings or auxiliary parameters, such as those proposed by [30], in Gibbs sampling algorithms.Additionally, the multivariate normal distribution has a thin tail, which is not suitable for estimating long-tail probabilities, such as the Poisson distribution.To address these limitations, [4] introduced the multivariate log-gamma distribution (MLG), which is conjugate with the Poisson distribution.The MLG distribution was derived from the multivariate gamma distribution formulated by [10] and transformed to the log-scale.However, this transformation complicates Gibbs sampling, as it requires component-wise updating to obtain known full-conditional distributions.Instead, the approach proposed by [4] allows for block-wise full-conditional distributions that are easier to simulate from.Furthermore, the MLG distribution exhibits asymptotic properties with respect to the multivariate normal distribution.A brief review of the MLG distribution is provided in the next section, and interested readers can refer to [4] for more details.In [16], it is demonstrated that the MLG distribution has better long-tail probability properties compared to the multivariate normal distribution.In other words, the MLG distribution not only serves as a conjugate prior for the Poisson distribution but also has the ability to handle long-tailed probabilities.The ultimate goal is to propose an MRF-MFM for spatially clustered coefficients in Poisson regression based on the MLG prior.

Probability Density Function for Multivariate Log-Gamma Distribution
We first review the multivariate log-gamma distribution from [4].
We define the -dimensional random vector  = ( 1 , ...,   ) ′ , which consists of  mutually independent log-gamma random variables with shape and scale parameters organized into the -dimensional vectors  ≡ ( 1 , ...,   ) ′ , and  ≡ ( 1 , ...,   ) ′ , respectively.Then define the -dimensional random vector  as follows where the matrix  ∈ R  × R  and  ∈ R  .[4] called  the multivariate log-gamma random vector.The random vector  has the following probability density function: where "det" represents the determinant function.As a shorthand we use the notation, MLG (, ,  , ), for the probability density function in (12).

Conditional Distributions for Multivariate Log-Gamma Random Vectors
Gibbs sampling from full-conditional distributions requires simulating from the conditional distributions of multivariate log-gamma random vectors.In this section, we provide a review of the technical results necessary for simulating from these conditional distributions.
We first look at Proposition 1 from [4].Let  ∼ MLG (, ,  , ), and let  = ( ′ 1 ,  ′ 2 ) ′ , where  1 is -dimensional and  2 is ( − )dimensional.In a similar manner, partition  −1 = [ ] into an  ×  matrix H and an  × ( − ) matrix B. Then, the conditional pdf of  1 is given by where  1.2 ≡ exp( −  −1  − ()) and the normalizeing constant M is so the cMLG(,  ,  1.2 ) is equal to the pdf in equation (17), where "cMLG" stands for "conditional multivariate log-gamma." In [4], it indicates that cMLG does not fall within the same class of pdfs given in (13).This is primarily due to the fact that the real-valued matrix H, within the expression of cMLG, is not square.Thus, we require an additional result that allows us to simulate from cMLG.
Next, we look at the Theorem 2 from [4].Let  ∼ MLG (0  , ,  , ), and partition this -dimensional random vector so that  = ( ′ 1 ,  ′ 2 ) ′ , where  1 is -dimensional and  2 is ( − )-dimensional.Additionally, consider the class of MLG random vectors that satisfy the following: (15) where in general 0 , is an  ×  matrix of zeros;  − is an ( − ) × ( − ) identity matrix; is the QR decomposition of the  ×  matrix H; the ;  1 is a  ×  upper triangular matrix; and  2 > 0. Hence, the marginal distribution of the g-dimensional random vector  1 is given by where the normalizing constant  1 is And, the -dimensional random vector  1 is equal in distribution to ( ′  ) −1  ′ , where the -dimensional random vector  ∼ MLG(0  ,   ,  , ).
In [4], it is evident that this particular class of marginal distributions (defined in Theorem 2 in [4]) falls into the same class of distributions as the conditional distribution of  1 given  2 .And Theorem 2 in [4] provides a way to simulate from cMLG.Furthermore, it shows that it is (computationally) easy to simulate from cMLG provided that  ≪ .Recall that  is ×, which implies that computing the  ×  matrix ( ′  ) −1 is computationally feasible when g is "small." We refer the readers to [4] for a comprehensive discussion.

FULL CONDITIONAL DISTRIBUTIONS AND ALGORITHM
We adapt the MRF-MFM in conjunction with MLG to a spatial Poisson regression setting, focusing on the clustering of spatiallyvarying coefficients  ( 1 ),  1 where the detailed derivations are given in Section A in the supplement.
where  − denotes the partition obtained by removing   and where, ,  , ) end for

Theoretical Properties under the Exchangeable Structure
In this section, we assume the covariates  (  ) are generated from random homogenous distribution so it is marginalized.The incorporation of proper dependency structures into the estimation process and assessing uncertainty is always an interesting subject.However, complex dependency structures may destroy the consistency of MFM.Therefore, to maintain theoretical consistency, this paper considers the case in which samples from the base measure are a subset of an infinite sequence of exchangeable variables.
In Bayesian Statistics, the infinite sequence of exchangeable random variables is an important concept.When  1 , . . .are infinite exchangeable, for any finite , where  () is the set of all permutations for the index set {1, . . .,  }.
If  1 , . . .are i.i.d.sampled from a distribution  (), then they are exchangeable, but the reverse is not always true.Some widely used models are based on exchangeable random variables that are not independent, like the Pólya's Urn [3] and Gaussian random variables that have the same marginal distribution and the same correlation between any two of them.
The famous de Finetti's Theorem [9] reveals the intrinsic characterization of exchangeable random variables: there is a latent random variable  , such that  1 , . . .,   are a subset of a infinite sequence of exchangeable variables sampled from Π( 1 , . ..).It is summarized into the following sampling procedure: where  only depends on Π( 1 , . ..).In other words, a subset of an infinite sequence of exchangeable variables are conditionally i.i.d.given their latent labels.We refer to [2] for more details on exchangeable sequences.
Theorem 3.1.Suppose the data generating process follows equation (6) with  replaced by the hierarchical distribution in equation (20), and the distribution is correctly specified.If   (1), . . .,   () > 0, denote  as the random variable for the number of clusters and  is all the possible values  will take in the true data generating process.Then we have Theorem 3.1 provide some insight into our proposed MRF-MFM, compared to Dirichlet process mixture model with the above Markov random fields [DP-MRF; 31].For DP-MRF, there could be a lot of small spurious clusters due to inconsistency of the Dirichlet process mixture even in the i.i.d.case [28].Due to the fact that we specify a prior distribution for the number of components, the number of components in the posterior is appropriately regularized.Even though the consistency result only holds for the exchangeable structure, we believe that the regularization effect holds for all types of structures.Theorem 3.1 is an extension of Theorem 5.2 in [29] to the case of an exchangeable base measure.The limitation of the above theorem is that it does not explore the frequentist property of the posterior, where the number of clusters is assumed to be a fixed truth.

SIMULATION 4.1 Settings
Our goal is to sample from the posterior distribution of the unknown parameters ,  = ( 1 , ...,   ) ∈ {1, ...,  } and  = ( 1 , . . .,   ).We choose  − 1 ∼ Poisson(1) and  = 1,  = 0  ,  = 100  and  =  = 100001  for all the simulations and real data analysis, where 0  is an -dimensional vector with 0, 1  is an -dimensional vector of 1's, and   is an -dimensional identity matrix.The computing algorithm and full conditional distributions are presented in Appendix 3, which efficiently cycles through the full conditional distributions of   | − for  = 1, 2, . . .,  and , where  − =  \   .The marginalization over  avoids the need for complicated reversible jump MCMC algorithms or allocation samplers.The posterior sampling algorithm is provided in Algorithm 1 in Appendix 3. Detailed deviations of the full conditional distributions are also outlined in Appendix 3. Using the posterior mean or median of clustering configurations  alone is not appropriate.Dahl's method [8] offers a remedy for posterior inference of clustering configurations based on the squared error loss.Additionally, alternative loss functions that do not rely on squared errors, such as those proposed in [34], can be considered.The Rand Index (RI) [32] is used to measure the accuracy of clustering.The tuning parameter in the Markov random fields requires careful selection in our proposed model.We utilize the Logarithm of the Pseudo-Marginal Likelihood (LPML) [18] for tuning parameter selection, where a model with a larger LPML value is preferred.

Simulation Setting and Evaluation Metrics
Our analysis is based on the spatial structure of the state of Georgia, which contains 159 counties.Using the county-level data, we build the graph using an adjacency matrix among different counties.159 counties represent 159 vertices in this graph, and if a county shares a boundary with another county, then   and   are connected.This graph is used for both simulation studies and real data analysis.We consider two different spatial cluster designs shown in Figure 1.The first design consists of two disjoint parts located in the top and bottom parts of Georgia.A second cluster comprises the counties in the middle.The second design comprises three major spatial clusters.It is designed to mimic a common premature death pattern in which geographically distant areas can share a similar distribution pattern, and geographical proximity is not considered the only factor responsible for homogeneity in premature death rates.
Two different scenarios are considered for each design.The first scenario does not take into account spatial random effects, while in the second scenario, spatial random effects are included for each design.The spatial random effects are assumed to follow a multivariate normal distribution with a mean zero and exponential covariogram.Our simulation study consists of four scenarios in total.The details of the data generation process are given as  The four scenarios are for two cluster design without spatial random effect, two cluster design with spatial random effect, three cluster design without spatial random effect, and three cluster design with spatial random effect, respectively.In the three clusters design, the original regression coefficients are set to be 0.5, 1 and 1.5 for each cluster correspondingly.On the other hand, in two clusters design, the original regression coefficient set to be 1 and 1.5 for each cluster, respectively.For each case, we add the spatial random effect with the intensity.We use the centroid coordinate in each county to represent that county then construct the spatial random effect.Also, the range parameter and spatial variance parameter are both fixed in each simulation.In each case, we avoid the zero count value to prevent numerical instability.Based on the estimated number of clusters and Rand Index (RI), the clustering performance is evaluated.Each replicate is also used to calculate the final number of clusters estimated.A total of 100 sets of data are generated under different scenarios.We run 5000 iterations of the MCMC chain and burn-in the first 1000 for each replicate.

Simulation Results
For each replicated data set, we fit MFM and MRF-MFM with different values of the smoothness parameter and select the best smoothness parameter for each replicate based on LPML.We see that our model outperforms the MFM model in terms of LPML in all four different scenarios.We also evaluate the performance in terms of estimation results of the number of clusters.We report the proportion of times the true cluster recovered among the 100 replicates.For the two-cluster without spatial random effect design, we find out our model can recover the true number of clusters 100% of the replicates.And the MFM model can recover 85% of the replicates.In this case, both models perform well in the number of clusters estimation.But our model outperforms the MFM model in terms of LPML value.For the two-cluster design with spatial random effects, we see that our model can recover the true number of clusters 97% of the replicates, but the MFM model did not recover the true cluster for any replicates.For the three-cluster without spatial random effect design, we find out our model can recover the true number of clusters 88% of the replicates.On the other hand, MFM recovers 62% of the replicates.Finally, for the three-cluster design with spatial random effects, we find out our model can recover the true number of clusters 73% of the replicates.However, MFM did not recover the true cluster for all replicates.
The results of the comparison of LPML, Rand index, and estimation of the number of clusters for each design can be found in Table 2. Our method can effectively estimate the true number of clusters based on the results shown in Table 2.However, if spatial random effects exist, MFM will overestimate the number of clusters.Our proposed method also outperforms vanilla MFM with respect to model fitness and clustering, as demonstrated by the LPML values and Rand index.Furthermore, we show the average mean square error (AMSE) of our proposed method and MFM in Table 3.We see that in all four different scenarios, our proposed method outperforms MFM in terms of coefficients estimations.The improvement of our proposed methods is evident for the data generated from the model with spatial random effect.In this study, the proposed methods are used to analyze the factors that influence the number of premature deaths in Georgia.The objective of this study is to investigate the relationship between premature deaths and environmental factors such as PM 2.5 and food environment index.The dataset is available at www. countyhealthrankings.org with 159 observations corresponding to the 159 counties in state of Georgia in 2015.For each county, the dependent variable is the number of the premature death in each county.The premature death is the death that occurs before the average age of death in a certain population.In the United States, the average age of death is about 75 years.The dependent variable is the number of lives lost per 100,000 population before age 75 in each county.The two covariates we consider in this paper are PM 2.5 ( 1 ) and food environment index ( 2 ).PM 2.5 is the average daily density of fine particulate matter in micrograms per cubic meter.The food environment index is the index of factors that contribute to a healthy food environment, 0 (worst) to 10 (best).Figures 2a and 2b present a visualization of the response and two covariates on the Georgia map.

Data Analysis
In this section, we apply the proposed methodology to present a detailed analysis of premature death data in the state of Georgia.First, we rescale the data to a decent range as the variance in the Poisson distribution is equal to the mean.The count of the premature death is scaled to hundreds.We run 25,000 MCMC iterations and burn-in the first 15,000 iterations.The smoothing parameter is tuned over the grid {0.1, 0.2, . . ., 1}.All other parameters are set to be consistent with the simulation study.The final clustering result corresponds to the largest LPML [18], hence we choose the smoothing parameter equal to 0.3.The 159 counties turned out to be put into four clusters as illustrated in Figure 3.The number of the counties in each cluster are 150, 3, 5 and 1, respectively.We also compare our model with the best LPML to vanilla MFM, Latent Gaussian  Based on the LPML results, our proposed model outperforms other models.In contrast, there are 15 different clusters identified by vanilla MFM.From the estimation results shown in Table 4, we see that all the counties with higher PM 2.5 will have higher premature deaths.For Cobb County, PM 2.5 has the largest effect on premature death.An extensive analysis could be conducted to investigate why the majority of counties are grouped into one cluster while the other three clusters only contain a couple of counties.For instance, one possible approach is to use log likelihood ratio test (LRT) to detect spatial cluster signals.

DISCUSSIONS
There are several topics beyond the scope of this paper that merit further investigation.Firstly, in our MCMC algorithm, a numerical integration is required for Gibbs sampling.Developing an efficient calculation algorithm for numerical integration would expand the applicability of our proposed methods.Additionally, the proposed algorithm encounters numerical instability when zero counts are observed, which should be addressed in future research.Moreover, different clusters may exhibit distinct sparsity patterns in the covariates.Incorporating spatially clustered sparsity structures of regression coefficients into the model would allow for the selection and identification of the most important covariates.The selection of a tuning parameter for the Markov random field is also necessary.Proposing a hierarchical model for the tuning parameter would be an interesting avenue for future work.Furthermore, exploring the frequentist properties of the posterior distribution is also an area that can be investigated in future research.

SUPPLEMENTARY MATERIALS
The supplementary materials, including a detailed comparison with related literature such as [24] and [1], proofs of the main theorems, derivations of full conditional distributions and MCMC algorithms, additional simulations using data from the states of Georgia and Mississippi, and reproducing codes for data analysis, are available on GitHub at https://github.com/pengzhaostat/MLG_MFM.
For each term   in  = (  , . . .,   ), the full conditional distribution is: where and and "det" is a short hand as determinant of a matrix.

B ADDITIONAL COMPARISON FOR SIMULATION (STATE OF GEORGIA)
We present additional comparison for simulation section (State of Georgia).We compare our proposed method to LGP and CAR in two cluster design.In Figure 4, the values above zero indicate that our method has higher LPML than comparator.The results shown that we have a better result for both comparator.

C ADDITIONAL SIMULATION FOR DIFFERENT SPATIAL GRAPH (STATE OF MISSISSIPPI)
We provide another simulation design with different spatial graph.This additional analysis is based on the spatial structure of the state of Mississippi, which contains 82 counties.We consider a different spatial cluster designs shown in Figure 5.This design consists of two disjoint parts located in the top and bottom parts of Mississippi.Two different scenarios are considered.The first scenario does not take into account spatial random effects, while in the second scenario, spatial random effects are included for each design.The spatial random effects are assumed to follow a multivariate normal distribution with a mean zero and exponential covariogram.Based on the estimated number of clusters and Rand Index (RI), the clustering performance is evaluated.Each replicate is also used to calculate the final number of clusters estimated.A total of 50 sets of data are generated under different scenarios.We run 3000 iterations of the MCMC chain and burn-in the first 1000 for each replicate.
The results of the comparison of LPML, Rand index, and estimation of the number of clusters for each design can be found in Table 5.Our proposed method outperforms vanilla MFM with respect to model fitness and clustering, as demonstrated by the LPML values and Rand index.Additional comparison to LGP and CAR also presented.In Figure 6, the values above zero indicate that our method has higher LPML than comparator.The results shown that we have a better result for both comparator.

D COMPARISON WITH THE RELATED LITERATURE
Both [24] and [1] perform clustered coefficients separately for different coefficients, as shown in Figures 2 and 4 in [24], and Figures 1 and 3 in [1].In contrast, our approach performs clustering for all coefficients together, where all coefficients share the same cluster configuration.For example, Figure 3 illustrates our approach generating a single clustered configuration on the county map, despite estimating three coefficients.It is important to correctly specify whether a single partition or different partitions should be used for a model in practice.Here, we consider a single partition for different coefficients for the following reasons: • Methodology: Using a single partition for different coefficients is more interpretable as it automatically produces a cluster assignment of the spatial objects.• Simulation: When the true data generation process is based on a single cluster assignment, a single partition with different coefficients performs better than having separate partitions.This is because the latter leads to more variance in estimations, resulting in overfitting the model.To demonstrate this, we implemented [24]'s approach using the R package genlasso and compared the average mean squared errors (AMSE) for a two-cluster design without spatial random effect.Our estimation error is around 0.085 according to Table 2, while the estimation error of [24]'s approach is 26.48.The reproducible code, "fusedlasso.R," is provided in the supplement.The substantial difference in estimation errors is due to (1) our model being correctly specified with the Poisson likelihood, while genlasso uses a Gaussian likelihood, and (2) our model being correctly specified with the single cluster assignment.• Theoretical details: Single clusters for different coefficients are less common in the literature, possibly because they are mathematically more complex.For example, in penalized regression ( [24]), a simple penalty on the aggregated coefficients is sufficient to induce separate clusters, corresponding to the fused lasso.However, inducing a common cluster assignment among different coefficients requires considering group-wise penalizations, corresponding to a more complex type of regularization.• Application: For our analysis of Georgia's premature death data, the coefficients of interest, such as PM 2.5 and the food environment index, are primarily influenced by environmental factors.Therefore, it makes sense to claim that their effects cluster in homogeneous groups.

E PROOF OF THE THEOREM 2.1
By Bayes' theorem, we have: where   specifies the cluster that   belongs to.With the property in equation ( 26 where  = |C| is the number of clusters while  is the corresponding random variable of  and   () is defined in equation (4).
The proof of this proposition directly follows from [29], since all conclusions only involves on C,  and  , while the i.i.d assumption on  is not used.Lemma F.2. Suppose the data generating process in Proposition F.1, such that the distribution is correctly specified.Given the cluster configuration C, the data  and the number of components  are independent.
Remark F.3.As with MFM, we generalize the same result to exchangeable cases.Since the dependence between  is totally decided by , when  are exchangeable, all the  play the same role in generating .When  are marginalized, the cluster configuration C covers the same information with the number of components  and the latent labels .
Visualizations of PM 2.5 and Food Environment Index

Figure
Figure Top: Illustration of 4 clusters identified by the proposed method for counties.Bottom: Illustration of 15 clusters identified by vanilla MFM for counties.

2 Figure 4 :
Figure 4: Additional Comparison for Two Cluster Simulation (State of Georgia).

Figure 5 :
Figure 5: Simulation design with two cluster assignments.(State of Mississippi)

2 Figure 6 :
Figure 6: Additional Comparison for Two Cluster Simulation (State of Mississippi).

Table 2 :
Simulation Results for Four Scenarios including LPML, Rand Index (RI), and number of true cluster cover rate by MRF-MFM (optimal) model and MFM model.We provide mean and standard deviation for both LPML and RI.

Table 3 :
AMSE for  Estimation under All Scenarios

Table 5 :
Simulation Results including LPML, Rand Index (RI), and number of true cluster cover rate (CR) by MRF-MFM (optimal) model and MFM model.We provide mean and standard deviation for both LPML and RI.