When Rigidity Hurts: Soft Consistency Regularization for Probabilistic Hierarchical Time Series Forecasting

Probabilistic hierarchical time-series forecasting is an important variant of time-series forecasting, where the goal is to model and forecast multivariate time-series that have hierarchical relations. Previous works assume rigid consistency over the given hierarchies and do not adapt well to real-world data that show deviation from this assumption. Moreover, recent state-of-art neural probabilistic methods also impose hierarchical relations on point predictions and samples of the predictive distribution. This does not account for full forecast distributions being consistent with the hierarchy and leading to poorly calibrated forecasts. We close both these gaps and propose PROFHiT, a probabilistic hierarchical forecasting model that jointly models forecast distributions over the entire hierarchy. PROFHiT (1) uses a flexible probabilistic Bayesian approach and (2) introduces soft distributional consistency regularization that enables end-to-end learning of the entire forecast distribution leveraging information from the underlying hierarchy. This enables calibrated forecasts as well as adaptation to real-life data with varied hierarchical consistency. PROFHiT provides 41-88% better performance in accuracy and significantly better calibration over a wide range of dataset consistency. Furthermore, PROFHiT adapts to missing data and can provide reliable forecasts even if up to 10% of input time-series data is missing, whereas other methods’ performance severely degrades by over 70%.


INTRODUCTION
Time-series forecasting is an important problem that impacts decisionmaking in a wide range of applications.In many real-world situations, the time-series have inherent hierarchical relations and structures.Examples include forecasting time-series of employment [27] measured at different geographical scales; epidemic forecasting [25] at county, state and country, etc.Given time-series dataset with underlying hierarchical relations, the goal of hierarchical time-series forecasting is to generate an accurate forecast for all time-series leveraging the hierarchical relations between time-series [12].Previous hierarchical forecasting methods assume that the dataset is strongly consistent: the time-series values of datasets strictly satisfy the underlying hierarchical constraints.Therefore, these models usually impose the generated forecasts to be strongly consistent as well i.e., forecasts strictly satisfy the hierarchical relations of the dataset.For example, classical two-step methods [13] use a bottomup or top-down approach where all time-series at a single level of the hierarchy are modeled independently and the values of other levels are derived using the aggregation function governing the hierarchy.In contrast, many real-world applications have weakly consistent datasets, i.e., the data do not follow the strict constraints of the hierarchy.Such datasets have an underlying data generation process that follows a hierarchical set of constraints but may contain some deviations.These deviations can be caused by factors such as measurement or reporting error, asynchrony in data aggregation and revision pipeline, etc, as frequently observed in epidemic forecasting [1].Most state-of-the-art methods are designed for applications having strongly consistent datasets by imposing rigid constraints -they thus may not adapt to such deviations and can provide poor forecasts for application with weakly consistent datasets Moreover, previous methods do not focus on providing calibrated forecasts with precise uncertainty measures.Traditional methods focus on point predictions only.Recent post-processing methods [2,27,30] refine base independent forecast distribution as a postprocessing step.While these methods can be easily applied to forecasts from any model, this does not enable the models generating the base forecasts to learn from hierarchical relations between timeseries of the hierarchy.End-to-end learning neural methods directly leverage hierarchical relations as part of the model architecture [24] or learning algorithm [11].Due to their comprehensive end-to-end approach, they usually outperform post-processing methods by imposing hierarchical constraints on the mean or fixed quantiles of the forecast distributions.However, these methods do not enforce hierarchical consistency on the full distributions.Therefore, the forecasts may not be well-calibrated [18] i.e., they produce unreliable prediction intervals that may not match observed probabilities from ground truth [10].
Table 1: Comparison of PROFHiT with state-of-the-art methods.

PROFHiT
In this work, we fill this gap of learning well-calibrated and accurate forecasts for both strong and weakly consistent datasets leveraging underlying hierarchical relations.We propose PROFHiT (Probabilistic Robust Forecasting for Hierarchical Time-series), a neural probabilistic hierarchical time-series forecasting method that provides an end-to-end Bayesian approach to model the distributions of forecasts of all time-series together (see Table 1 for a comparison).Specifically, we introduce a novel Soft Distributional Consistency Regularization (SoftDisCoR) to tackle the challenge.First, SoftDisCoR enables PROFHiT to leverage hierarchical relations over entire forecast distributions to generate calibrated forecast distributions by encouraging the forecast distribution of any parent node to be similar to the aggregation of children nodes' forecast distributions (Figure 1).Second, since SoftDisCoR is a soft constraint, our model is trained to adapt to datasets with varying hierarchical consistency that allows the model to trade-off consistency for better accuracy and calibration on weakly consistent datasets.Our main contributions are: (1) Accurate and Calibrated Probabilistic Hierarchical Time-Series Forecasting: We propose PROFHiT, a deep probabilistic framework for modeling the distributions of each time-series together using the soft distributional consistency regularization (SoftDisCoR).PROFHiT leverages probabilistic deep-learning models to learn priors of individual timeseries and refines the priors of all time-series leveraging the hierarchy to provide accurate and well-calibrated forecasts.(2) Adaptation to Strong and Weak Consistency via Soft Distributional Consistency Regularization: SoftDis-CoR imposes soft hierarchical constraints on the full forecast distributions to help adapt the model to varying levels of hierarchical consistency.We build a novel refinement module over base forecast priors and leverage multi-task learning over shared parameters that enable PROFHiT to perform consistently well across the hierarchy.(3) Evaluation Across Multiple Datasets and with Missing Data: We show that our method PROFHiT outperforms a wide variety of state-of-the-art baselines on both accuracy and calibration, at all levels of the hierarchy, for both strong and weakly consistent datasets.We also show training using SoftDisCoR enables PROFHiT to leverage hierarchical relations to provide reliable predictions that can handle missing data values in the time-series.

RELATED WORK
Probabilistic time-series forecasting Classical probabilistic timeseries forecasting methods include exponential smoothing and ARIMA [13].They are simple but focus on univariate time-series and model each time-series sequence independently.Recently, deep learning based methods have been successfully applied in this area.DeepVAR [26] trains an auto-regressive recurrent network model on a large number of related time series to directly output the mean and variance parameters of the forecast distribution.Other works are inspired from the space-state models and explicitly model the transition and emission components with deep learning modules such as deep Markov models [17] and deep state space models [19,23] Recently, EpiFNP [14] has achieved state-of-art performance in epidemic forecasting.It learns the stochastic correlations between input data and datapoints to model a flexible non-parametric distribution for univariate sequences.Hierarchical time-series forecasting Classical works on hierarchical time-series forecasting used a two-step approach [12,13] and focus on point predictions.They first forecast for time-series only at a single level of the hierarchy and then derive the forecasts for other nodes using the hierarchical relations.
Recent methods like MinT and ERM are post-processing steps applied on the set of forecasts at all levels of hierarchy.MinT [29,30] assumes that the base-level forecasts are uncorrelated and unbiased and solve an optimization problem to minimize the variance of forecast errors of past predictions.The unbiased assumption is relaxed in ERM [2].Corani et al. [6] and [22] use a fully Bayesian bottom-up post-processing approach using base forecasts from full hierarchy.Another line of works projects the base forecasts of all time-series into a subspace of consistent forecasts.[9] use an iterative Game-theoretic approach of minimizing forecast error and projection error.Taieb et al. [27] uses copula method to refine base forecasts to be distributionally consistent as a post-processing step.Recent neural methods perform end-to-end learning that enables the model to leverage hierarchical relations while forecasting.Rangapuram et al. [24] use a deep-learning based end-to-end approach to directly train on the projected forecasts.SHARQ [11] is another recent probabilistic deep-learning based method that uses quantile regression and regularizes for consistency at different quantiles of forecast distribution.However, unlike our approach, these endto-end methods do not regularize for forecast consistency over the entire distribution (Distributional consistency) but only over fixed quantiles.Most of these methods also are not designed for cases where the hierarchical constraints are not always consistently followed.

PRELIMINARIES 3.1 Problem Statement
Consider the dataset D of  time-series over the time horizon 1, 2, . . ., .Let y  ∈ R  be time-series  and  Definition 1 (Consistency Error -CE).Given a dataset D of  time-series over the time horizon 1, 2, . . ., and aggregation relations  T as above, the dataset consistency error (CE) is defined as (Intuitively, datasets with lower CE have time-series values which more strictly follow relations  T ).

Definition 2 (Strong and weak consistency).
A dataset D is strongly consistent if  T (D) = 0. Otherwise, D is said to be weakly consistent.
Let the current time-step be .For any 1 ≤  1 <  2 ≤ , we denote y . ., y 1:  ] and hierarchical relations  T , a model  is trained to predict the marginal forecast distributions at time  + for all time-series of hierarchy leveraging past values of all time-series: Along with the accuracy of probabilistic forecasts, we also evaluate forecast distributions for calibration.We define calibration of model forecasts based on previous works [14,18]: Definition 3. (Calibration Score of a Model) Given a model  we define a calibration function   : [0, 1] → [0, 1] as follows: Given a confidence ,   () is the fraction of the predictions for which the ground truth lies within -confidence interval.The calibration score  () is the total deviation between  and   (): |D  )} across all levels of the hierarchy for both weakly and strongly consistent datasets.

Functional Neural Process for Base Forecasts
PROFHiT first derives base forecasts for all the node from any differentiable base forecasting model such that we can use backpropagation on the loss function to update the parameters of the base forecasting model as well.Formally, the base forecasting model outputs the base forecast distribution parameters {  ,   }  =1 from input time-series of all nodes as  ({  ,   }  =1 |{y 1:  }  =1 ).We leverage the recent advances in using Functional Neural Process [20] based non-parametric probabilistic sequential models that have provided state-of-art accurate and calibrated predictions in many domains [14,15].These models model the uncertainty of the input time-series as well as its correlation with time-series in training data to provide calibrated forecast distribution.Specifically, we use a slightly modified version of the model proposed in [14] and denote it as TSFNP.The only difference between [14] and TSFNP is that instead of modeling correlations with past values of the same time-series as in univariate time-series forecasting case, we model correlation with the time-series from all nodes' past.We provide a detailed description of TSFNP in the Appendix.For our further discussion, we can view TSFNP as a stochastic model with some latent variables: where  denotes the full set of latent variables of TSFNP.

METHODOLOGY
Overview.PROFHiT models the forecast distributions of all timeseries nodes of the hierarchy { ( =1 by leveraging the relations from the hierarchy to provide accurate and well-calibrated forecasts that are adaptable to varying hierarchical consistency.Most existing methods do not attempt to model the entire probabilistic distribution but focus on the consistency of point forecasts or samples or fixed quantiles of the distribution [11,24].This approach does not fully capture the uncertainty of the forecasts and in turn, does not provide calibrated predictions.Methods like PEMBU, MinT and ERM are post-processing steps that can be applied to base forecasts from any model and provide theoretical guarantees.However, they do not allow the forecasting model to learn from relations across the hierarchy.Moreover, most methods assume that the datasets are strongly consistent over hierarchical relations.However, many real-world datasets are weakly consistent with time-series values of all nodes of the hierarchy observed simultaneously and may not follow the hierarchical relations strictly due to noise and discrepancies in collecting data at different levels.Therefore, most previous works may not adapt well to such deviations from these constraints.PROFHiT, on the other hand, reconciles the need to model consistency between entire forecast distributions as well as induce a soft adaptable constraint to enforce consistency via a two-step process that is trained in an end-to-end manner.The first component of PROFHiT is a differentiable neural probabilistic model such as TSFNP that produces a base forecast distribution for each node parameterized by {(  ,   )}  =1 .Base forecasts of all nodes are used as priors to derive a refined set of forecast distributions parameterized by {( μ , σ )}  =1 via the Hierarchy-aware Refinement Module described in Section 4.1.We introduce the novel Soft Distributional Consistency Regularization (SoftDisCoR) that enables PROFHiT to produce refined forecast distributions that are distributionally consistent with the hierarchical relation  T as described in Section 4.2.The full probabilistic process of PROFHiT is depicted in Figure 2.

Hierarchy-aware Refinement Module
The base forecast distributions  ({  ,   }|D  ) produced by TSFNP (or any other model that can be used in its place) do not leverage the underlying hierarchical relations  T .This may lead to sub-optimal forecasting performance and inconsistent forecasts.The refinement module is a differentiable module that aims to fuse the information from base forecasts of all nodes to output refined forecast distributions that can leverage SoftDisCoR to be consistent.
Formally, given the parameters of base forecast distributions {  ,   }  =1 derived from TSFNP for all time-series {y ( ′ : )  }  =1 , the refinement module derives the refined forecast distributions denoted by parameters { μ , σ }  =1 as functions of parameters of base forecasts of all time-series.Let  = [ 1 . . .,   ] and  = [ 1 . . .,   ] be vectors of means and standard deviations of base distributions.Since each of the node's refined distribution parameters depends on all  node's base forecast parameters, the refinement process must be efficient in fusing the information from all the base forecasts for each node.Moreover, since we require that PROFHiT should be adaptable to datasets of both strong and weak consistency, the refinement process should automatically learn to trade-off between the influence of base forecast distribution for each node and the fused information from all the nodes.Considering these objectives, we derive the mean μ of refined distribution as a weighted sum of two terms: a)   , the mean of base time-series, and b) linear combination of all base mean of all time-series: { ŵ }  =1 and {w  } =1: are both learnable set of parameters of the model.sigmoid(•) denotes the sigmoid function.The operations in Equation 3 have a total computational complexity of  ( ) for each node and therefore  ( 2 ) in total.This is on par with previous state-of-art end-to-end refinement methods like HierE2E [24] and more efficient than post-processing methods like MinT and ERM during inference.The learnable parameter   allows the refinement module to trade-off between the influence of the base distribution of node  and the influence of the other nodes of the hierarchy making PROFHiT automatically adapt to datasets with varying hierarchical consistency.
Similarly, we assume the variance of the refined distribution depends on the base mean and variance of all the time-series.The variance parameter σ of the refined distribution is derived from the base distribution parameters  and  as where {v 1 }  =1 , {v 2 }  =1 and {  }  =1 are parameters and  is a positive constant hyperparameter.Note that the complexity of Equation 4is also  ( 2 ).

Soft Distributional Consistency Regularization
While the refinement module helps aggregate information from base forecasts to refine the distribution parameters, we also need to design the loss function such that parameters of the refinement module and TSFNP utilize the underlying hierarchical relations  T to provide hierarchically consistent forecast distributions by effectively utilizing information from all nodes of the time-series.
For the full distribution of the refined forecasts to be consistent, we use a Distributional Consistency Error(DCE) as part of the loss function and regularize the full distribution of all nodes.
The Distributional Consistency Error (DCE) is defined as follows: Definition 4. (Distributional Consistency Error -DCE) Given the forecasts at time  +  as {  ( where  is a distributional distance metric.
Leveraging distributional consistency error as a soft regularizer enforces forecast distributions to be well-calibrated while adaptively adhering to the hierarchical relations of the dataset.
For the distance metric Dist in Equation 5, we use the Jensen-Shannon Divergence [8] (JSD) as the distance metric since it is a symmetric and bounded variant of the popularly used KL-Divergence distance.Moreover, it assumes a closed form for many widely used distributions including for the Gaussian used in PROFHiT.While we can replace JSD with other distance measures for capturing distributional similarity, we observed that JSD was sufficient for providing good forecast performance in our applications.We derive the distributional consistency error on Gaussian parameters {( μ , σ )}  =1 as The computation of JSD is generally intractable.However, in our case, due to parameterization of each time-series distribution as a Gaussian we get a closed-form differentiable expression: Derivation of Distributional Consistency Error.To derive Equation 7, we use the following well-known result for JSD of two Gaussian Distributions [21]: Given two univariate Normal distributions Consider each JSD term of the summation in Equation 6.Note that and (10) Using Equation 8along with Equations 9,10 we get the desired result in Equation 7.
We use the distributional consistency error as a soft regularization term to enable PROFHiT to leverage constraints  T when generating forecast distributions.We do not make DCE a hard constraint since the model needs to adapt to datasets of varying consistency.Particularly, for weakly consistent datasets, we do not require PROFHiT to strictly adhere the hierarchical relations  T which may result in sub-optimal forecast accuracy and calibration, since the ground truth does not follow  T as well.Therefore, using DCE as a soft-regularizer allows the model to adapt to varying strictness of  T across different domains.

Details on Training
Training loss.Along with the SoftDisCoR loss L dist which informs PROFHiT of hierarchical relations  T optimizes for distributional consistency, we derive the Likelihood Loss L ll to optimize for the accuracy and calibration of the forecasts.Using TSFNP (or any other base forecasting model with latent variables) as the, the full probabilistic process of PROFHiT can be summarized as: Integrating over the latent variables  in Equation 11 is highly intractable.Therefore, we use variational inference by approximating the posterior over the latent variables  (|{ ) and derive an ELBO L ll which we use as the optimization objective.The details of the derivation of ELBO loss are in the Appendix.Since the refinement module is a deterministic mapping from base to refined distribution parameters, the ELBO derivation is very similar to that in [14].Therefore, our framework is flexible to adapt to a wide range of neural forecasting models with different learning algorithms.
Thus, the total loss for training is given as L = L ll + L dist where the hyperparameter  controls the trade-off in importance between data likelihood and distributional consistency.We also use the reparameterization trick to make the sampling process differentiable and we learn the parameters of all training modules via Stochastic Variational Bayes [16].The full pipeline of PROFHiT is summarized in Figure 2.
Parameter sharing across nodes.Since PROFHiT's TSFNP module forecasts for multiple nodes, we leverage the hard-parameter sharing paradigm of multi-task learning [3] and use a different set of parameters for Predictive Distribution Decoder (i.e., weights of Θ 3 ) whereas the parameters of other components of TSFNP are shared across all nodes (Figure 2).Sharing parameters for Probabilistic Neural Encoder drastically lowers the number of learnable parameters since datasets can have a large number of nodes (up to 512 nodes in our experiments).
Pre-training on individual time-series.Before we start training for refined forecasts, we pre-train the parameters of TSFNP on given training dataset to model base forecast distribution accurately.We pre-train using only log-likelihood loss to learn parameters {  ,   }  =1 .

EXPERIMENTS
We evaluate PROFHiT over multiple datasets and compare it with state-of-the-art baselines1 .

Setup
Baselines: We compare PROFHiT's performance against state-ofthe-art hierarchical forecasting methods.We also compare against state-of-the-art general probabilistic forecasting methods to study the importance of modeling the hierarchy for both weak and strongly consistent datasets.
(1) TSFNP [14]: a neural forecasting model for accurate and calibrated forecasts described in Section 3.2 (2) DeepAR [26]: another state-of-the-art deep probabilistic forecasting models which do not exploit hierarchy relations.(3) MinT [30]: a post-processing method for reconciliation of base forecasts (4) ERM [2]: another post-processing method like MinT that relaxes unbiased assumptions of base forecasts (5) HierE2E [24] is a recent state-of-the-art deep learning based approach that projects the base predictions onto a space of consistent forecasts and trains the model in an endto-end manner.( 6) SHARQ [11] is another state-of-the-art deep learning based approach that reconciles forecast distributions by using quantile regressions and making the quantile values consistent.(7) PEMBU [27] is a post-processing method that refines base forecasts to be distributionally consistent.
Note: In our experiments, ee performed ERM and MinT on Monte Carlo samples of TSFNP predictive distribution since TSFNP provided better results compared to DeepAR.We also use the mean forecast from MinT and ERM as input forecasts for PEMBU.Datasets: We evaluate on a diverse set of publicly available datasets (Table 2) from different domains with varied hierarchical relations and consistency.The benchmarking dataset and evaluation setup is replicated from recent and past literature related to general hierarchical forecasting as well as epidemic forecasting.Tourism-L, Labour and Wiki are constructed by collecting values of leaf nodes and deriving the values of the time series of other nodes of the hierarchy.Hence, they are strongly consistent with zero CE (Definition 1).The values of each node of the hierarchy in the case of Flu-Symptoms and FB-Survey are directly collected or measured.For example, the values of Flu-Symptoms dataset are collected from public health agencies at the state, HHS and national levels and aggregated by CDC.Due to factors like reporting discrepancies and noise, they contain values in time series that may deviate from the given hierarchical relations [4].Therefore, these datasets are weakly consistent with significant CE (Table 2).We also provide level wise consistency errors for all the datasets in the Appendix.
Evaluation metrics.We evaluate our model and baselines using carefully chosen metrics that are widely used in the literature to measure accuracy and calibration.We also measure the distributional consistency of the output forecast to study how well the model trade-off accuracy and consistent for datasets of varying consistency errors.For a ground truth  ( ) , let the predicted probability distribution be p ( ) with mean ŷ ( ) .Also let F ( ) be the CDF.•Mean Absolute Percentage Error (MAPE) is a commonly used score for point-predictions calculated as •Log Score (LS) is a standard score used to measure the accuracy of probabilistic forecasts in epidemiology [25].LS measures the negative log likelihood of a fixed size interval around the ground truth under the predictive distribution: Similar to [25], the log-likelihood of a forecast is capped at -10.The calculation of LS is tractable due to the gaussian assumption on the forecast distribution.
•Calibration Score (CS): To measure the calibration of forecasts, we use the calibration score defined in Section 3.1.We approximate the integral via Riemann sum over [0, 1] with step-size 0.05.
•Cumulative Ranked Probability Score (CRPS) is a widely used standard metric for the evaluation of probabilistic forecasts that measures both accuracy and calibration.Given ground truth  and the predicted probability distribution p , let F be the CDF.Then, CRPS is defined as: We approximate F as a Gaussian distribution formed from samples of the model to derive CRPS.
•Distributional Consistency Error (DCE): We calculate the Distributional Consistency Error (Equation 6) on output forecast distributions during inference to study how PROFHiT and baselines leverage SoftDisCoR to learn from hierarchical relations across datasets of varying consistency and trade-off consistent, calibration and accuracy, especially for weakly consistent data (Section 5.2 Q3).

Results
We comprehensively evaluate PROFHiT through the following questions: Q1: Does PROFHiT predict accurate calibrated forecasts?Q2: Does PROFHiT provide consistently better performance across all levels of the hierarchy?Q3: Does SoftDisCoR help PROFHiT outperform baselines on both strongly and weakly consistent datasets?Q4: What impact do various modeling choices have on the model's overall performance?Q5: How does improved calibration and forecast consistency help PROFHiT deal with missing values in data?
Accuracy and calibration performance (Q1).We evaluate all baselines, PROFHiT and its variants for all the datasets over 5 independent runs.The average scores across all levels hierarchy are shown in Tables 3. PROFHiT significantly outperforms all baselines in MAPE score by 13%.It also outperforms the baselines in LS and CS significantly in most cases.Finally, PROFHiT shows 41-88% better CRPS scores.Thus, PROFHiT adapts well to varied kinds of datasets and outperforms all baselines in both accuracy and calibration.
Performance across the hierarchy (Q2).Next, we look at the performance of all models across each level of the hierarchy.We compared the performance of PROFHiT with best-performing baselines HierE2E and SHARQ for all datasets.PROFHiT significantly outperforms the best baselines both in terms of accuracy and calibration.The performance improvement is consistent across all levels of the hierarchy in most of the benchmarks.We show detailed results in the Appendix.This shows that the model effectively leverages hierarchical relations across all nodes to provide significantly more accurate and calibrated forecasts across the hierarchy.

Effect of SoftDisCoR on datasets of varying consistency (Q3).
Since most previous state-of-the-art models assume datasets to be strongly consistent, deviations from this assumption can cause under-performance when used with weakly consistent datasets.This is evidenced in Table 3 where most of the baselines explicitly optimize for hierarchical consistency as a hard constraint on the forecasts.For example, PEMBU's forecasts have better distributional consistency error (DCE) for weakly consistent datasets.However, they perform much worse in both accuracy and calibration than even TSFNP, which does not even leverage hierarchical relations.Since we use SoftDisCoR as a soft learning constraint, PROFHiT can learn to trade-off consistency for accuracy and calibration.Therefore, PROFHiT provides 93% better CRPS and significantly better calibration over the best baselines.These improvements are more pronounced at non-leaf nodes of hierarchy where PROFHiT's performance is significanly larger for the weakly consistent Flu-Symptoms and FB-Survey datasets.In the case of strongly consistent datasets, PROFHiT provides 54% better CRPS and better calibration while having comparable DCE to PEMBU.We provide further analysis of these observations in the Appendix.
Ablation study on various modeling choices (Q4).We evaluate the efficacy and contribution of our various modeling choices including the usefulness of SoftDisCoR, refinement module, hard-parameter sharing, and using TSFNP as our model of choice for base forecasts.We perform an ablation study using the following variants of PROFHiT: • P-NoConsistency: This variant is trained by completely removing the SoftDisCoR from the training.Note that unlike P-FineTune which was initially trained with SoftDisCoR before fine-tuning, P-NoConsistency never uses the SoftDisCoR at any point of the training routine.Therefore P-NoConsistency measures the importance of explicitly regularizing over the information from the hierarchy.
• P-NoRefine: We remove the hierarchical refinement module and optimize the base forecasts for both likelihood and SoftDisCoR loss.
• P-DeepAR: We evaluate our choice of using TSFNP, a previous state-of-the-art univariate forecasting model for accurate and calibrated forecasts by replacing it with DeepAR [26], another popular probabilistic forecasting model that was used by HierE2E.
• P-NoParamShare: We study the effect of our multi-tasking hard-parameter sharing approach (Section 4.3) by training a variant where all the parameters for each TSFNP forecasting module are shared across all the nodes.• P-FineTune: We also look at the efficacy of our soft regularization using both losses that adapt to optimize for both consistency and training accuracy by comparing it with a variant where the predictive distribution decoder parameters are further fine-tuned for individual nodes using only the likelihood loss.We compare the performance of PROFHiT with its variants in CRPS, CS and DCE in table 4 (Rest of the metrics are in Appendix).We observe that PROFHiT is comparable to or better than the best-performing variant in most cases.We observe that the best-performing variant for strongly consistent datasets is P-NoParamShare which is trained with both likelihood loss and SoftDisCoR (Table 3).But its performance severely degrades for weakly consistent datasets since sharing all model parameters across all time-series makes it inflexible to model patterns and deviations specific to individual nodes.In contrast, P-FineTune and P-NoConsistency performs the best among variants for weakly consistent datasets since they train separate sets of decoder parameters for each node.But they perform poorly for strongly consistent datasets since they don't leverage Distributional Consistency effectively.PROFHiT combines the flexible parameter learning of P-FineTune and leverage Distributional Consistency to jointly optimize the parameters like P-NoParamShare providing comparable performance to best variants over all datasets.
Adapting to missing data (Q5).Accurate and well-calibrated models that can effectively leverage the knowledge of the hierarchy can intuitively allow models to better adapt to noise/missing data.Hence, we introduce the task of Hierarchical Forecasting with Missing Values and study the adaptation of models when there are missing values in time-series.We model a situation that is encountered in many real-world applications such as Epidemic Forecasting where the past few values of time-series are missing due to various factors like data reporting delays [4].
Formally, at time-period , we are given full data up to time  − .We set  = 5 since it is the average forecast horizon of all datasets.For sequence values in the period between  −  and , we randomly remove % of these values across all time-series.The models are trained on the complete time-series dataset till time  ′ =  − .Models' predictions are then used to fill in missing values for time  ′ to .Finally, we input the filled time-series to generate the forecasts for future time-steps.
We measure the relative decrease in performance of PROFHiT and baselines with an increase in the percentage of missing data  (Figures 3).We observe that PROFHiT's performance decrease as the fraction of missing values increases is much slower compared to other baselines.Even at  = 10%, PROFHiT's performance decreases by 10.45-26.8%compared to other baselines that typically decrease by over 70%.Thus, PROFHiT effectively uses hierarchical relations to generate reliable predictions on strong and weakly consistent datasets.
We compare relative performance decrease with an increase in the percentage of missing data for PROFHiT and its variants in Figure 4. We observe that P-NoConsistency's performance deteriorates very rapidly in most benchmarks, showing the importance of SoftDisCoR for learning provides calibrated and consistent forecasts.The second worst-performing variant across all datasets is P-FineTune which also relies less on the hierarchical relations due to fine-tuning of parameters for specific time-series.This is followed by P-NoRefine which performs particularly worse in strongly consistent datasets due to the absence of the refinement module to directly learn refined distributions by combining information from base forecasts.Finally, we observe that PROFHiT and P-NoParamShare suffer the least degradation in performance since both these models prioritize integrating hierarchical consistency  information which enables them to provide better estimates for imputed data for missing input and use them to generate more accurate and calibrated forecasts.

CONCLUSION AND DISCUSSION
We introduced PROFHiT, a probabilistic hierarchical forecasting model that produces accurate and well-calibrated forecasts using soft distributional consistency regularization (SoftDisCoR).This enables PROFHiT to adapt to datasets with varying levels of hierarchical consistency.We evaluated PROFHiT against previous state-of-the-art hierarchical forecasting baselines over a wide variety of datasets and observed 41-88% improvement average improvement in accuracy and significantly better calibration scores.
PROFHiT provided the best performance across the entire hierarchy as well as significantly outperformed other models in providing robust predictions when it encountered missing data where other baselines' performance degraded by over 70%.We also showed the efficacy of various design choices of PROFHiT including using TSFNP for generating raw forecasts, multi-tasking approach of partial parameter sharing, refinement module, and introducing the novel distributional consistency loss as a soft regularizer.
Our work opens new possibilities like extending to various domains where time-series values across the hierarchy may not be continuous real numbers, can not be modeled as Gaussian distributions or may have different sampling rates.We can also explore modeling more complex structures between time-series with different aggregation relations.PROFHiT can also be used to study anomaly detection in time-series, especially in time-periods where there are deviations from assumed consistency relations.Similar to Kamarthi et al. [15], we can extend our work to include multiple sources of features and modalities of data both specific to each time-series and global to the entire hierarchy.

A DETAILS ON TSFNP
We briefly describe the components of TSFNP here and direct the readers to the [14] for more details.
1) Probabilistic Neural Encoder: It models the temporal patterns of the input time-series and the uncertainty of its latent representation.It encodes the input univariate time-series into a latent stochastic embedding via a GRU [5] followed by a self-attention layer [28]: 2) Stochastic Data Correlation Graph: We next model the correlations between the input time-series and other time-series of the dataset to capture contextual representation and uncertainty of the input data point with respect to training data distribution.These contextual representations are called local latent variable.The only difference in our approach compared to [14] is that, unlike [14] which uses past time-series information from the same node, in our multi-variate case TSFNP uses past information from all nodes.Formally, for input sequence y we sample sequences from the the past training sequences y  where  ∈ {1, . . .,  } using similarity between their latent stochastic embeddings {u  }  =1 .For input time-series of node  and each of the past training sequences y  , we sample y j with probability exp(− ||u  − u  || 2  2 ) into set   .Then, we derive the local latent variable as where Θ 1 and Θ 2 are feed-forward networks.
3) Predictive Distribution Decoder: The final step of TSFNP's stochastic process involves combining the representations from Probabilistic Neural Encoder and Stochastic Data Correlation Graph that capture relevant sequential and contextual representation and uncertainty of input time-series.We combine the latent stochastic embedding, local latent variable and combined information of all past sequences to derive the parameters of the output distribution via a simple feed-forward network.We first derive a global latent variable that combines the information from local latent embeddings of all past sequences as z = Self-Atten({u  }  =1 ) via a self-attention layer over {u  }  =1 and summation of self-attention layer's output.
Finally, we combine the latent embedding of input time-series, local latent variable and global latent variable to derive the base forecast distribution modeled as a Gaussian N (  ,   ) as: where Θ 3 is a feed-forward network.
The full stochastic process of TSFNP can be summarized as: Note that in the main paper we note the set of all latent variables {u  , z  ,   }  =1 , z as .
Note on running time.The novel component of PROFHiT is the Hierarchy-aware refinement module that facilitates the integration of base forecast distributions.As described in lines 398-403, the total computational complexity of obtaining the refined distributional parameters is  ( 2 ) ( is the number of nodes in the hierarchy), which is comparable to the reconciliation step of end-to-end methods like HierE2E.Post-processing techniques such as MinT, ERM, and PEMBU have an even higher time complexity of  ( 3 ).
Note that the other portion of the pipeline that may add to the time-complexity is the base forecasting models.Models like DeepAR and RNN used by HierE2E, SHARQ as well as the post-processing methods and TSFNP (used by PROFHiT) scale linearly with respect to the length of the time-series and linearly with the number of nodes  .Therefore all these baselines and PROFHiT use models with similar time-complexity for base forecasts with respect to the size of the hierarchy  .

B CODE AND DATASET
We evaluated all models on a system with Intel 64-core Xeon Processor with 128 GB memory and Nvidia Tesla V100 GPU with 32 GB VRAM.We provide our implementation of PROFHiT along with the datasets used at https://github.com/AdityaLab/Profhit.

C HYPERPARAMETERS C.1 Data Preprocessing
Most datasets used in our work assume the aggregation function to be a simple summation (i.e,    = 1 for all weights).We first normalize the values of leaf time-series training data to have 0 mean and variance of 1.Since the aggregation of values at higher levels of the hierarchy can lead to very large values in time-series, we instead divide each non-leaf time-series by the number of children.Then the weights of hierarchical relations become    = 1 |  | where   is the set of all children nodes of time-series .For the remaining datasets (Flu-Symptoms, FB-Symptoms) the time-series values are normalized by default and thus require no extra pre-processing.

C.2 Model Architecture
The architecture of TSFNP used in PROFHiT is similar to that used in the original implementation [14].The GRU unit contains 60 hidden units and is bi-directional.Thus the local latent variable is also of dimension 60.   1 and   2 are both 2-layered neural networks with the first layer shared between both.Both layers have 60 hidden units.Finally,   3 is a three-layer neural network with the input layer having 180 units (for the concatenated input of three 60-dimensional vectors) and the last two layers having 60 hidden units.We found that the value of  in Equation 4 is not very sensitive and is usually set to 5.
Note that we do not explicitly model covariance between every pair of time series (like MinT, ERM) and use a weighted combination of base forecast parameters to derive refined forecasts.Therefore the refinement module complexity (Section 4.1) is  ( 2 ) which is on par with previous methods like HierE2E. ) : 1 ≤ 1 ≤ 2 <  −  } and train the full model (TSFNP and refinement module).We tune the hyperparameter using backtesting by validating on window  −  to .Finally, we train for entire training set with the best hyperparameters.

C.3 Training and Evaluation
For each benchmark, we used the validation set to mainly find the optimal batch size and learning rate.We searched over batch-size of {10, 50, 100, 200} and the optimal learning rate was usually around 0.001.We also found the optimal  to be around 0.01 for strongly consistent datasets and 0.001 for weakly consistent datasets.We used early stopping with the patience of 150 epochs to prevent overfitting.For each independent run of a model, we initialized the random seeds from 0 to 5 for PyTorch and NumPy.We didn't observe large variations due to randomness for PROFHiT and all baselines.
During evaluation, we sampled 2000 Monte-Carlo samples of the forecast distribution and used it to estimate the mean for MAPE.We also used the samples mean and variance to evaluate LS and CS whereas used ensemble scoring to evaluate CRPS directly from the samples using properscoring package 2 .
by canceling similar terms between the variational and true distribution of latent variables.

E CONSISTENCY OF DATASETS
We noted in Section 5.2 Q4 that Flu-Symptoms and FB-Survey are weakly consistent datasets since they do not strictly follow the aggregation relations  T unlike strongly consistent datasets Tourism-L, Labour, Wiki.We empirically observe this by measuring Consistency errors of all datasets (Definition 1) for the entire hierarchy and at each level of the hierarchy.The results are in Table 5.As expected there no deviations for strongly consistent datasets whereas there is a significant deviation in weakly consistent data.

F PERFORMANCE ACROSS EACH LEVEL OF HIERARCHY
We compared the performance of PROFHiT with best-performing baselines HierE2E and SHARQ for each level of hierarchy of all datasets.PROFHiT significantly outperforms the best baselines as well as the variants.At the leaf nodes, which contains most data, PROFHiT outperforms best baselines by 7% in Wiki to 100% in FB-Survey.For the top node of time-series the performance improvement is largest at 35% (Wiki) to 962% (FB-Survey).We show detailed results in Tables 6 and 7.

G DETAILS ON DATA IMPUTATION EXPERIMENT
Motivation: During real-time forecasting in real-world applications such as Epidemic or Sales forecasting, we encounter situations where the past few values of time-series are missing or unreliable for some of the nodes.This is observed specifically at lower levels, due to discrepancies or delays during reporting and other factors [4].Therefore, one approach to performing forecasting in such a situation is first by imputation of missing values based on past data and then using the predicted missing values as part of the input for forecasting.
Task: To simulate such scenarios of missing data and evaluate the robustness of PROFHiT and all baselines, we design a task called Hierarchical Forecasting with Missing Values (HFMV).Formally, at time-period , we are given full data for up to time  − .We show results here for  = 5 which is the average forecast horizon of all tasks.For sequence values in the time period between  −  and , we randomly remove % of these values across all time-series.The goal of HFMV task is to use the given partial dataset from  −  to  as input along with complete dataset for time-period before  −  to predict future values at  + .Therefore, success in HFMV implies that models are robust to missing data from the recent past by effectively leveraging hierarchical relations.
Setup: We first train PROFHiT and baselines on complete dataset till time  ′ and then fill in the missing values of input sequence using the trained model.Using the predicted missing values, we again forecast the output distribution.For each baseline and PROFHiT, we perform multiple iterations of Monte-Carlo sampling for missing values followed by forecasting future values to generate the forecast distribution.We estimate the evaluation scores using sample forecasts from all sampling iterations.

H ADAPTING TO VARYING DATASET CONSISTENCY
Observation 1.The average improvement in performance of PROFHiT over best forecasting baselines is 72% higher for weakly consistent datasets over its improvement for strongly consistent datasets.
Since most previous state-of-art models assume datasets to be strongly consistent, deviations to this assumptions can cause underperformance when used with weakly consistent datasets.This is evidenced in Table 3 where some of the baselines like MinT and ERM that explicitly optimize for hierarchical consistency perform worse than even TSFNP, which does not leverage hierarchical relations, in Flu-Symptoms and FB-Survey.Overall, we found that for weakly consistent datasets, PROFHiT provides a much larger 93% average improvement in CRPS scores over the best baselines compared to 54% average improvement for strongly consistent datasets.These improvements are more pronounced at non-leaf nodes of hierarchy where PROFHiT improves by 2.8 times for Flu-Symptoms and 9.2 times for FB-Survey.This is because the baselines which assume strong consistency do not adapt to noise at leaf nodes that compound to errors at higher levels of hierarchy.Observation 2. PROFHiT's approach to parameter sharing and soft consistency regularization helps adapt to varying hierarchical consistency.
We observe that that best performing variant for strongly consistent datasets in P-NoParamShare which is trained with both likelihood loss and SoftDisCoR (Table 3).But its performance severely degrades for weakly consistent datasets since sharing all model parameters across all time-series makes it inflexible to model patterns and deviations specific to individual nodes.In contrast, P-FineTune and P-NoConsistency performs the best among variants for weakly consistent datasets since they train separate sets of decoder parameters for each node.But they perform poorly for strongly consistent datasets since they don't leverage Distributional Consistency effectively.PROFHiT combines the flexible parameter learning of P-FineTune and leverage Distributional Consistency to jointly optimize the parameters like P-NoParamShare providing comparable performance to best variants over all datasets.The design choices of the refinement module help PROFHiT to adapt to datasets of different levels of hierarchical consistency.Specifically, by optimizing for values of {  }  =1 of Equation 3, PROFHiT aims to learn a good trade-off between leveraging prior forecasts for a time-series and hierarchical relations of forecasts from the entire hierarchy.We study the learned values of {  }  =1 of Equation 3 used to derive refined mean.Note that higher values of   indicate larger dependence on base forecasts of node and smaller dependence of forecasts of the entire hierarchy.We plot the average values of   for each of the datasets in Table 9.We observe that strongly consistent datasets have lower values of   indicating that PROFHiT's refinement module automatically learns to strongly leverage the hierarchy for these datasets compared to weakly consistent datasets.

Figure 1 :
Figure 1: PROFHiT learns to produce accurate and calibrated forecasts from datasets of varying consistency by leveraging underlying hierarchical relations via Soft Distributional Consistency Regularization its value at time .The time-series have a hierarchical relationship denoted as T = ( T ,  T ) where  T is a tree of  nodes rooted at time-series 1.For a non-leaf node (time-series) , we denote its children as C  .The node values are related via set of relations  T of form  T = {y  =  ∈ C     y  : ∀ ∈ {1, 2, . . .,  }, |C  | > 0} where values of    are known and time-independent real-valued constants.

Figure 2 :
Figure 2: Overview of pipeline of PROFHiT.The input time series is ingested by TSFNP, a Functional Neural Process based probabilistic forecasting model, to output the base forecast distribution.The parameters of base forecasts are refined by the Hierarchy-aware Refinement module using predictions from all the time-series.The training is driven by a likelihood loss that learns from ground truth and Soft Distributional Consistency Regularization that regularizes the forecast distribution to follow the hierarchical relations.

( 1 ) 3 )
Labour dataset contains monthly employment data from Feb 1978 to Dec 2020 collected from the Australian Bureau of Statistics.(2) Tourism-L [30] contains tourism flows in different regions in Australia grouped via region and demographic.It has two sets of hierarchies (with four and five levels), one for the mode of travel and the other for geography with the top node being the only common node of both hierarchies.(Wiki dataset collects the number of daily views of 145000Wikipedia articles aggregated into 150 groups[27].These 150 groups are leaf nodes of a four-level hierarchy with groups of similar topics aggregated together.(4) Flu-Symptoms contains flu incidence values called weighted influenza-like incidence (wILI) values[25] at multiple spatial scales for USA for period of 2004-2020.The scales used are states, HHS and National level (US states are grouped into 10 HHS regions by CDC).(5) FB-Survey provides an aggregated anonymized daily indicator for the prevalence of Covid-19 symptoms based on online surveys conducted on Facebook[7] from Dec 2020 to Aug 2021 for each state and national level.We use the state-level values to find aggregates at HHS levels.

Figure 3 :Figure 4 :
Figure 3: % increase in CRPS for all models with increase in proportion of missing data.
Given the training dataset D  we extract the training dataset for each node as the set of prefix sequences {(y ( 1: 2)  ,  ( 2+1) perfectly calibrated model is such that ∀ :   () ≈ .

Table 2 :
Dataset Characteristics and Consistency

Table 3 :
Average scores (across 5 runs) across all levels of hierarchy for all baselines, PROFHiT and its variants.PROFHiT provides top performance in terms of all evaluation metrics in most of the benchmarks.

Table 4 :
Average scores (across 5 runs) across all levels of hierarchy for PROFHiT and its ablation variants.The best score is bolded and the second best is underlined.

Table 5 :
Average deviation of observed values in time-series from hierarchical relations.

Table 6 :
Average CRPS scores at each level of hierarchy.PROFHiT significantly outperforms best baselines across all benchmarks.Note that P-Finetune's performance decreases at higher levels of hierarchy compared to other variants whereas P-Global's performance is worse at levels.

Table 7 :
Average CS scores at each level of hierarchy.PROFHiT significantly outperforms best baselines across all benchmarks.

Table 8 :
Std. dev of CRPS and LS (accros 5 runs) across all levels for all baselines, PROFHiT and its variants.PROFHiT performs significantly better than all baselines as noted using t-test with  = 1%.

Table 9 :
Average value of   for all datasets.Note that weakly consistent datasets have higher   (depends mode on past data of same time-series) where as strongly-consistent data have lower   (leverages the hierarchical relations).