Big Data-Driven Portfolio Simplification: Leveraging Self-Labeled Clustering to Enhance Decision-Making

In the evolving landscape of business analytical practice, big data stands as a pivotal force, steering organizational strategies, particularly in portfolio management across end-to-end businesses. With the surge in data's volume, variety, veracity and velocity, there is a pressing need for sophisticated computational methods to demystify intricate business portfolios, thereby facilitating astute decision-making. Traditional portfolio analysis techniques, although foundational, grapple with the challenges posed by expansive, multifaceted data and volatile market dynamics. To counter these challenges, our research pioneers an innovative approach, harnessing the power of clustering algorithms to refine and consolidate business portfolios. We employ big data techniques to analyze and categorize extensive portfolio datasets, unearthing inherent groupings and patterns. Leveraging clustering algorithms, we categorize business entities by similarity, yielding a streamlined and lucid portfolio blueprint. Our approach not only enhances the clarity of vast business portfolios but also strengthens strategic decision-making capabilities, propelling organizational nimbleness and market competitiveness. Through comparative analyses, our solution showcases significant advantages in portfolio simplification and decision-making efficacy over conventional techniques.


Introduction
The global business environment's complexity necessitates optimized operations for resilient supply chains.We introduce a novel approach aimed at simplifying materials, enhancing supply chain resilience, and improving business continuity strategies.Our goal is to streamline material varieties and advocate for industry-standard materials, enhancing our negotiation position, fostering a competitive supplier environment, and ultimately boosting organizational productivity [6].
Towards this goal, we propose a cutting-edge machine learning solution for material clustering, aiming to minimize intra-cluster distances and optimize material categorization [12].This solution provides businesses with accurate, timely data and accessible, selfservice machine learning tools, highlighting opportunities for material harmonization and strategic optimization [8].
Our approach involves a comprehensive data model that captures the global diversity of procured materials, offering deep insights for enhanced productivity.Collaborating with key industry partners, we aim to refine our clustering methods, aiming for significant financial savings, potentially surpassing hundreds of millions of dollars [9]. To ensure precision and adaptability, we implement a cohesive global strategy that integrates the expertise of experienced users for iterative model refinement.

Clustering Techniques
Clustering is a pivotal data mining technique that organizes data into coherent subsets based on feature similarities, addressing various challenges across diverse domains.We categorize clustering techniques into hierarchical and partitional, each requiring prior knowledge of cluster counts, a common hurdle when dealing with real-world datasets [2,10].This paper explores both traditional and automatic clustering methods, with focus on material clustering in supply chains.Commonly used clustering techniques include: • Partitional Clustering: Techniques such as -means partition data into non-overlapping subsets, requiring a predefined cluster count [1].• Density-based Clustering: Examples include DBSCAN, grouping densely packed data points and then merging clusters.• Grid-based Clustering: Treating data space as a grid, these methods offer faster clustering, independent of data object numbers.Examples include STING and CLIQUE.

Big Data and ETL Technologies
In response to the big data revolution, we propose a novel extract, transform, and load (ETL) and MapReduce/Hadoop approach, focusing on signal components for data processing [3,7].The approach starts with automated ETL dataset generation, followed by training and validation, laying the groundwork for an efficient ensemble model.The highlights of the proposed approach include: • Novel exploration of ensemble learning with big data platforms for large-scale datasets in supply chain management.• Incorporation of diverse category inputs for machine learning labels and categorization in the ensemble model.• Empirical validation of our ensemble learning model, showcasing superior precision over singular models.
Sections to follow detail these aspects, covering literature review, predictive architecture, intelligent integration, data processing, ensemble learning, experimental results, and conclusions, guiding future research directions.

Related work 2.1 A Renaissance of Traditional Clustering Through Deep Learning
Clustering stands as a critical technique in data mining and practical applications [5].Amidst the subtle hum of computer servers in a dimly-lit room, a groundbreaking revelation echoed through the halls of the 2019 Neural Information Processing Systems (NeurIPS) conference.A novel innovation, aptly named DeepCluster, marked the harmonious union of venerable clustering techniques with the avant-garde realm of deep learning.
DeepCluster introduces a visionary strategy, intertwining the potent capabilities of CNNs with the systematic precision of -means clustering [4].-means, renowned for its ubiquity and effectiveness in the clustering domain [11], plays a central role in this methodology.The DeepCluster paradigm is rooted in a seemingly straightforward, yet profoundly transformative idea: employ CNNs, such as the esteemed VGG16, to transmute complex, high-dimensional data into a realm of greater clarity and manageability.This transformation renders the subsequent clustering process not just feasible, but markedly more accurate.
The DeepCluster approach unfolds through meticulous steps: • Purification and Augmentation: Prior to in-depth analysis, the raw data undergoes a cleansing, normalization, and augmentation process, ensuring both quality and diversity.between the second and third steps, culminating in a result that approaches perfection.As the NeurIPS conference drew to a close, an air of excitement and anticipation pervaded the space.DeepCluster had transcended the status of a mere methodology: it stands as a profound testament to the synergistic potential of integrating the wisdom of traditional methods with the ingenuity of contemporary advancements.In an era marked by rapid technological progress, DeepCluster exemplifies how a retrospective integration of the old and the new can catalyze groundbreaking innovations.

Semi-Supervised Adaptive Clustering with Neural Network Embeddings
The year was 2020, and the setting was the illustrious International Conference on Machine Learning (ICML).In the grand halls, filled with a symphony of whispered conversations and a palpable excitement, a revolutionary concept was taking the center stage: Semi-Supervised Adaptive Clustering with Neural Network Embeddings.This is the crux of the study presented.By introducing a beacon -a tiny fragment of labeled data -into the vast expanse of the unsupervised dataset, clustering could be refined and guided, much like a compass navigating through the wilderness.The journey is meticulously charted: • Data Split: Before embarking on the expedition, data is segregated, with a small portion labeled and the rest left in its natural, unlabeled state.• Neural Compass: Data is then transformed, or as the researchers termed it, "embedded" into a simpler realm using neural networks.This realm, being lower-dimensional, is easier to traverse and understand.• Spectral Path: In this transformed space, the spectral clustering method acts as the primary explorer, attempting to categorize data.• Guidance from the Beacon: As the data is clustered, the beacon -the labeled segment -provides feedback, refining and rectifying the formed clusters.The results, to the joy and awe of attendees, showcased the sheer potency of this innovative method.As attendees exited the grandeur of the ICML hall, they left with a profound lesson: In the world of data, sometimes, a small beacon of guidance can illuminate the most intricate paths.And thus, the narrative of Semi-Supervised Adaptive Clustering found its place in the annals of machine learning lore.

ETL Process and Framework Deployment
Leveraging clustering techniques, our initiative meticulously analyzes a comprehensive business portfolio encompassing 14,000 materials.The primary goal is to align with prevailing market standards and strategically diminish the material count.This datadriven approach significantly amplifies buying power, bolsters productivity, ensures substantial cost savings, and carves out a competitive niche with an unwavering focus on quality.
Integrating clustering into our operational framework unlocks access to nuanced insights, particularly crucial during raw material shortages.This ensures an uninterrupted production chain while the machine learning model's continuous learning capability systematically uncovers opportunities for harmonization and optimization, fostering a proactive stance in the volatile market.
At the core of our ETL mechanism are two pivotal elements: (1) an extensive dataset derived from daily business interactions, and (2) a well-structured hierarchy of business domains culminating in a comprehensive time-series analytical table.Our approach adopts a refined MapReduce schema, utilizing a network of virtual machine instances dedicated to parallel processing, aiming to enhance operational efficiency and truncate preprocessing durations, as depicted in Table 1.
Delving deeper, our MapReduce paradigm operates through a dual-architecture of Mappers and Reducers, facilitating processes from data sanitization to sophisticated data amalgamation and partitioning.Given the voluminous nature of our data, parallel processing becomes imperative.Facing the complexities of intricate business landscapes, a single MapReduce tier often falls short.Hence, by employing a multi-tiered data processing strategy and synchronizing mapper/reducer operations, we navigate through

Data Harmonization
Data harmonization is crucial in data science, ensuring data coherence and applicability.Our harmonization pipeline starts with selecting relevant attributes based on domain expertise, aligning with our research goals.
In the preprocessing phase, we refine the raw data, employing various methods tailored to different data types, enhancing data quality and integrity.Common procedures like imputation, outlier correction, and null value treatment are applied to all data types, followed by specialized treatments: • Numerical Data: Apply scaling for consistency and enhanced analysis precision.• Categorical Data: Transform using one-hot encoding, quantifying without introducing ordinal relationships.• Textual Data: Remove stop-words and special characters, then tokenize and vectorize.Feature vectors are extracted using TF-IDF.In post-preprocessing, data from different categories are integrated, forming composite data streams for model ingestion.We explore various data combinations for model optimization, laying a robust foundation for model training and evaluation.

Implementation of the Proposed Algorithm
The utilization of MapReduce not only facilitates the preparation of input data but also enables the resolution of real-world problems on a large scale.In this work, we design the PA2LMR algorithm by incorporating synchronization, which transforms the vast amount of portfolio data into constructed data, category by category, and employs the second-layer MapReduce to perform classification and generate the data subset to run the associated clustering algorithm.The specific steps of the algorithm are outlined in Algorithm 1.
In this algorithm, Data PD represents all portfolio data with Category Identifier (CI), which combines material category, IDs, and location IDs.Each Data entry contains all associated attributes along with corresponding historical quantities.The first-layer MapReduce sorts the Data by categories and transforms it into a unified format for each category, classifying requirements for each location.The second-layer MapReduce applies a classification algorithm within each categories, categorizing the attributes into different data categories, and subsequently performs clustering algorithms based on these classifications.
Initially, portfolio data is extracted from the system using ETL queries.Lines 1 through 4 illustrate the first layer mapper, which sorts and shuffles all portfolio data to the respective categories.Lines 5 through 9 correspond to the first-layer reducer, which transforms the data into a unified format by formatting them into constructed data.Lines 10 through 13 represent the second-layer mapper process, responsible for identifying the classification and shuffling the data accordingly.Finally, lines 14 through 18 depict the second-layer reducer process, which employs the clustering ensemble models to generate cluster results.

Data Collection and Experiment Environment
The experimental data remain confidential due to the need to maintain the anonymity of the private business collection.The portfolio data is collected as indicated in Table 1.The experiment was conducted in an environment consisting of a cluster of 4 nodes running on Google Cloud Platform VMs with E2 instances and 128GB memory.The configuration is provided in Table 2.

Self-Labeled Clustering Algorithms to Enhance Decision-Making
In the domain of unsupervised machine learning, self-labeled clustering algorithms emerge as sophisticated techniques, synergizing the merits of both clustering and self-training paradigms.Initially, these algorithms delineate data into distinct clusters without the benefit of pre-existing labels, mirroring conventional clustering methodologies.In the ensuing phase, they intuitively label the clustered entities by pinpointing data points that exemplify high degrees of representativeness or confidence within each cluster.These inferred labels subsequently inform and refine the clustering process or assist in deeper analytical pursuits.The salient feature of this methodology is its incremental refinement of cluster quality through self-labeling, rather than sole reliance on preliminary clustering results.This progressive refinement is pivotal in augmenting decision-making, especially when it engenders sharper data categorizations in contexts where labeled data remains elusive or entirely absent.
Given the absence of a target variable for prediction or classification, this research delves into unsupervised clustering techniques to delineate spec groups for raw materials.The objective is to discern clusters that spotlight similarities among the raw materials.The clustering strategies in focus encompass Hierarchical Clustering, K-Medoids, PCA + Hierarchical Clustering, FAMD + Hierarchical Clustering, and K-Prototypes.Feedback from Subject Matter Experts (SMEs) informs the selection and validation of these clusters.

Ensembled Clustering
4.1.1K-Medoids Clustering: At the heart of the K-Medoids algorithm is the medoid-a representative point within a cluster that minimizes the sum of dissimilarities with all other points in the same cluster.Mathematically, the dissimilarity between the medoid   and an object   is articulated as  = |  ˘ |.The exploration of diverse unsupervised clustering methodologies offers insightful avenues for raw material grouping.By harnessing the collective strengths of these techniques and incorporating expert feedback, it is conceivable to achieve nuanced, precise, and actionable cluster formations.

Integrating Self-Labeled Weights in Clustering Algorithms
Navigating through the complexities of clustering with diverse data types calls for innovative strategies.We propose an ensemble of various foundational clustering models, selectively enhancing their performance using self-labeled weights for attributes.These weights, derived from domain knowledge, elevate the impact of precisely labeled attributes on the ensemble's performance.
This approach mitigates the shortcomings of standalone models, leveraging the synergy between them to bolster overall predictive accuracy.We design the SLWCluster algorithm to exemplify this strategy, aiming to refine the ensemble's predictive capabilities.Details of this algorithm are provided in Algorithm 2.

Proposed Algorithm
In this algorithm, the list of models utilized for ensemble learning is denoted as ModelSet, while Model-List represents the list of all basic models used by the method.It is important to note that ModelSet is a subset of ModelList ( ⊆ ), pre-selected for different attribute categories.
Firstly, a pre-selected method is employed to select a model combination from the ModelList, initiating the stacking integration training process.
Lines 1 to 11 depict the training of all models in the first layer using convergence iteration, resulting in the generation of predicted groups (  ) for each model (), as indicated in the output.
Lines 12 to 14 demonstrate the second layer of the stacking integration process.During this stage, the predicted groups from the first layer models are used as features, and the final ensemble model is trained by combining these features with the initial ones.for each  ∈  do Increase weights on attributes close to labels =0 Lines 15 to 18 outline the self-label weights.The comparison between output groups with the labels is employed to adjust attribute weights.If the group of the currently active model combination is close to the label of business inputs, the weight of each model is updated accordingly.Ultimately, the algorithm outputs the best clustering result.

Evaluation Metrics
In this paper, we employ external evaluation metrics to assess the performance of the clustering models.These metrics include Adjusted Rand Index (ARI), Normalized Mutual Information (NMI), Fowlkes-Mallows Index (FMI), Jaccard Index, Purity, and Completeness and Homogeneity.Given a true clustering  and an estimated clustering , they are defined as follows: where  denotes the total number of objects,    denotes the number of objects in common between cluster   of the true clustering and cluster   of the estimated clustering,   denotes the number of objects in cluster   of the true clustering,   denotes the number of objects in cluster   of the estimated clustering,  (, ) denotes the mutual information between  and ,  () denotes the entropy of ,  () denotes the entropy of ,   denotes the True Positives,   denotes the False Positives,   denotes the False Negatives,  denotes the total number of samples, and   denotes the -th class.

Optimizing
Clustering with ARI and K Adjusted Rand Index (ARI) reveals the optimal alignment between ground truth and K-Means clusters at  = 5, aligning with the data origin from 5 Gaussian distributions.Other metrics, including Silhouette Score, Davies-Bouldin, and Dunn Index, concur, highlighting  = 5 as the most balanced choice for cluster compactness and separation.Both external and internal metrics unanimously pinpoint  = 5 as the optimal cluster number for dataset  in -Means, ensuring consistency in results.

Implementation and Performance Evaluation
We leverage a substantial transaction data volume and business domain layers for the ETL process, culminating in an analytical base table.Utilizing the MapReduce schema for parallel computing across virtual machine clusters, we aim to expedite data preprocessing and transformation.

Proposed Algorithm
The algorithm employs ModelSet for ensemble models and Mod-elList for all potential models, where ModelSet ⊆ ModelList.The optimal model combination is termed as BestMSet, and its corresponding peak score as BestMScore.
Initiation involves selecting a model combination from ModelList for stacking integration training (Lines 1-11), generating predicted groups   for each model .The second layer entails using these predicted groups as features for the final ensemble model, integrated with original features (Lines 12-15).
Lines 16-18 depict the self-label adjustment phase, enhancing attribute weights when predicted groups closely match true labels.The algorithm concludes by outputting the optimal clustering center for the remaining groups.

Data Collection and Experiment Environment
The experimental data remain confidential due to the need to maintain the anonymity of the private business collection.The experiment was conducted in an environment consisting of a cluster of 10 nodes running on cloud VMs with E2 instances and 128GB memory.5.2.2Running Time and Performance Evaluation This MapReducer approach, employed as a data preprocessing strategy, serves to reduce model complexity, harness the power of parallel computing, and eliminate unnecessary attributes.The overall processing time, from input data to output generation, is reduced to 1-2 minutes for the real-time PyDash App including all data pre-processing, clustering, and visualization stated in Table 3.

Conclusion and Future Work
Our innovative approach combines explainable AI with selflabel clustering, employing SHAP values to provide clear and wellfounded interpretations of cluster assignments.This methodology enhances transparency and enriches our understanding of the dataset's underlying patterns, revealing nuances often missed by conventional clustering techniques and enabling more informed decision-making.
While our present approach has exhibited notable advantages, there is an expansive horizon for further exploration: (1) Integration with Other Explainability Frameworks: Beyond SHAP values, there are other tools such as Local Interpretable Model-agnostic Explanations (LIME) that can be incorporated to offer a richer tapestry of interpretations for cluster assignments; (2) Real-time Interpretability: An exciting avenue for future endeavors is to enable real-time elucidations for cluster assignments, particularly in scenarios where the dataset is fluid and ever-evolving.(3) Enhanced Visualization Paradigms: For a myriad of stakeholders, visual insights are indispensable with PyDash Application.Designing custom visualization paradigms, specifically tailored for SHAP values in a clustering context, can offer more immediate and discernible insights; (4) Optimization and Scalability: As the magnitude of datasets expands, it is pivotal to ensure that our method scales gracefully.Delving into distributed computing and honing the efficiency of SHAP computation could be vital next steps.
In sum, the melding of explainable AI with self-label clustering represents a monumental stride in contemporary AI methodologies.It promises not just the demystification of clustering algorithms but also signals a brighter future for transparent and responsible machine learning practices.

4 . 1 . 2
Hierarchical Clustering: Hierarchical clustering is renowned for its flexibility in cluster determination.With the aid of a dendrogram, a tree-like diagram that illustrates the arrangement of clusters, one can decide on the optimal number of clusters.4.1.3PCA + Hierarchical Clustering and FAMD + Hierarchical Clustering: These approaches incorporate the principles of dimensionality reduction prior to executing hierarchical clustering.Both the Principal Component Analysis (PCA) and Factor Analysis of Mixed Data (FAMD) techniques are deployed to shortlist significant components.4.1.4K-Prototypes: Designed to accommodate datasets comprising mixed data types, both numerical and categorical, K-Prototypes is a hybrid clustering methodology.It enhances the paradigms of K-Means and K-Modes, allowing for effective clustering of mixed datasets.

for each model m in M do 8 : 9 :
Training m-model on { −  } Make predictions   on   10: Choose model   in M  {2 −  − } 11: Train m model on {   ,   } and predict    12:

Table 2 :
Parameter configuration on modeling scale.
3.3.3DimensionReductionDimension reduction is crucial in tackling our high-dimensional dataset, where many features are correlated.We opt for sophisticated techniques over mere attribute removal to preserve crucial information.For numerical data, PCA identifies and distills components for maximum variance retention, used in later clustering.For mixed data frames, FAMD is employed, integrating principles of PCA and multiple correspondence analysis, yielding maximum variance components.Furthermore, a variable pruning step is conducted to discard low variance variables (0-2% range), ensuring model efficiency.