Embracing a Cluster Perspective: Optimal Matching Analysis Tutorial and Tool Demonstration

This paper presents a tutorial on Optimal Matching Analysis (OMA). OMA is a computational approach that focuses on processes which compare sequences of states of and among observations in a sample. It has various advantages, from analyzing multiple waves of data as a whole unit of analysis, to rigorously handling censored or truncated data. This paper introduces OMA, illustrated by a 5-step tutorial on how to prepare data for, conduct and apply this technique. This tutorial replicates parts of Setor and Joseph's paper to illustrate the application and interpretation of OMA using updated cohort data from the National Longitudinal Survey of Youth 1997 (NLSY97). OMA applied to the NLSY97 reveals 6 clusters of residential patterns with demographic (e.g., sex and ethnicity), and human capital (e.g., education and income) descriptions. This tutorial also shares the analysis of multiple sets of sequences (etc., residential and career). Overall, this paper hopes to encourage OMA as a viable approach in IS research and equip researchers with an understanding of how OMA works. This approach would be useful for researchers, especially those who routinely track and work with repository data and those who intend to look at their data in a different light.


INTRODUCTION
Studying change in IS research is vital for understanding the growth or evolution of individuals and institutions.Only by delving into the dynamic processes of change can we build richer explanations of the phenomena we study.Studying change involves examining changes over time at the micro level (e.g., personal agency) factors, macro level (e.g., institutional factors), or both.Cross-sectional analyses, although useful, risk painting not only static but also attempts to link separate parts of the picture together would render them irrelevant in the face of rapid changes.At the same time, Markovian processes -a step-by-step approach concerned with the probability of specific state changes also stand to benefit from a complementary broad-based analytical perspective.
There have been few studies in the behavioral information systems domain that have examined patterns or sequences of behaviors.Some of these works include analysing organization characteristics with information-sourcing strategies [4].Other studies have looked at workers' adoption of IT innovations [9], along with routine and digital capabilities in organizations [8].Meanwhile, other work has analyzed users' searching behaviors using apps [5].
Optimal matching analysis (OMA) is an approach that computationally compares changes of states among observations [1].From its origin in genetic sequencing research to its popularity in lifecourse and career research, it is an approach that is based on a non-parametric test.OMA uses computational methods to plot, visualize, and identify sequential patterns of data.While other sequence analysis approaches have their merits in analyzing historical patterns and other applications more suited for genome research, OMA is the social science counterpart suited for analyzing behaviors that occur in sequences.It uses a pairwise alignment, generally using the simplest and least computationally expensive approach to compare two sequences in search of dissimilarity to analyze sequential patterns.OMA has numerous features that complement conventional cross-sectional and Markovian approaches.These features include OMA's capability to analyze multiple waves of data as a whole unit of analysis, where state elements may or may not change over time ( [2] , [11]).It can also deal with censored or truncated data that is common in longitudinal research -a situation where data typically becomes available or unavailable over time.
This paper provides an overview and tutorial on OMA, covering the basic technique.This tutorial uses data from the publicly available National Longitudinal Survey of Youth 1997 (NLSY97).The paper first provides a conceptual overview of OMA.Second, it provides a step-by-step approach on how to prepare the data with details on constructing single and multiple dimensions (multidomain).Third, it covers the analysis of 2 residential and career path sequences.Finally, the paper concludes with a discussion of OMA implications for theory and empirical analysis.This paper contributes by encouraging researchers to consider OMA as a practical approach to investigating patterns.Optimal matching analysis is a computational approach that calculates the extent of dissimilarity among sequences of states generated by observations in a sample.A sequence represents a chain of categorical states.For example, a simple sequence of a 5-wave data may look like R-R-R-U-U, where R represents a case living in a rural area and U in an urban area.Should an interval represent a year, the sample sequence means the participant lived in a rural residence for the first three years, followed by the remaining two years in an urban place.
The resulting analytical output from OMA is a 2-dimensional matrix.For example, a simple dataset of 5 sequences (N = 5) will result in a 5 by 5 matrix.The matrix presents various distances between pairs of sequences in a dataset [13].With our simple dataset, there will be 10 pairwise comparisons [(N(N-1))/2].To calculate the pairwise distances, a matrix of the 'cost' of substitution and indel (i.e., insertion and deletion) for all possible permutations among the different states-rural, suburban, and urban residential needs to be calculated.The 'costs' are computed based on the occurrences of pairwise state changes in the data.The most common changes tend to have the lowest 'cost' while the least common changes tend to have the highest 'cost'.Thereafter, pairwise distances are derived from the computed matrix of 'cost', producing a distance matrix output.This output then served as input for cluster analysis to meaningfully organize the data into groups.Setor and Joseph adopted an unsupervised (or hierarchical) clustering algorithm to surface natural groups in the data and a supervised (or non-hierarchical) clustering algorithm to validate the result of the hierarchical clustering algorithm [12].The clusters are visualized using a distribution plot to highlight patterns which may be used to label each cluster.The clusters may be used in subsequent analyses.

OPTIMAL MATCHING ANALYSIS TUTORIAL
Figure 1 presents the 5 major steps that Setor and Joseph used in their OMA [12].With new data from the NLSY97, this paper aims to surface the residential patterns and validate these patterns with demographic (e.g., sex and ethnicity) and human capital (e.g., education and income) data.This paper also explores the prototypical career paths of participants who reside within and across different geographic locations.

Extracting and Filtering NLSY97 Data
NLSY97 data is publicly available and is commissioned by the US Bureau of Labor Statistics (BLS).It is a longitudinal survey of youths conducted by the Center for Human Resource Research at Ohio State University [14].The survey follows the life course of Americans born between the years 1980 to 1984.The data consists of 8,984 respondents between 12 and 17 at the point of their first interview.Respondents were surveyed annually since 1997 and bi-annually from 2011 onwards.This paper used the available data consisting of 19 rounds from 1997 to 2019.
NLSY97 data is appropriate for this purpose because it contains comprehensive information, such as demographic, career and residential data of respondents surveyed annually.The data allows us to track respondents' careers and geographic residence statuses to carry out the research objective.We extracted the relevant data for the residence profile.We obtained the participants' self-reported residence location (etc.0 = Rural, 1 = Urban) and Metropolitan Statistical Area (MSA) data (etc. 1 = not in MSA, 2 = in MSA, not in central city, 3 = in MSA, in central city, 4 = in MSA, not known, 5 = not in country).For career data, we extracted the job where respondents worked the longest period in a given year, the number of hours worked, as well as demographic data such as gender, ethnicity, annual income, and education.
Following Setor and Joseph, the data were filtered according to the following criteria: (1) -at least 18 years old and had attained a high-school diploma, (2) -held a full-time civilian job, and (3)had provided at least 13 years of residence and workforce data with a total of 19th wave of data collection [12].The first and second criteria prevent any skewed data.This is because participants under the age of 18 are only able to hold a limited type of occupation under US law, compared to those who are 18 and above, who can hold almost all occupations.
While the 1st criterion can be filtered based on self-reported age and education status, the 2nd criterion job where respondents worked the longest period in a given year.Job status was based on the hours worked, taking reference from the [15] job code status.These job codes were outlined as follows (i) -full-time paid job (37 hours or more per week), (ii) -a part-time paid job (19-36 hours per week), (iii) -part-time paid job (1-18 hours per week) and (iv) -no work/unemployment/other.
The last criterion on respondents providing at least 13 years of residence and workforce data was a methodological consideration.We ensured that every case sequence did not have more than 30% of the job profile missing.The reason for the check was to ensure that each case did not have too much data missing, as OMA will produce inconsistent estimates [6].The threshold of 30% is based on a recommended best practice of analysis and results conducted from Monte Carlo simulations.In other words, for the 19-wave data, cases with career and residence sequences that had less than 13 counts were filtered out.We conducted this filtering procedure after coding and constructing the various sequences.The number of respondents excluded from the first criterion is 2098, and for the second criterion is 404.

2.2.1
Step 1: Coding Single and Multiple Domains.After extracting the data, the first step is to code the data.We defined codes representing a residential and career status in a particular year [10].Coding schemes were generally derived from a theoretical framework.Data used for sequence analysis can typically be in two forms: single or multiple domains [11].Single domain represents participants' residential profiles (rural, urban and suburban) determined by the US Census Bureau in 1 [16].5) not in country NA (-1) Refusal, (-2) Don't know, (-3) invalid skip, (-4) valid skip, (-5) non-interview) NA a Coding of residential profiles with [16] classification using Metropolitan Statistical Area (MSA) Data Each point in a sequence can also be represented by multiple domains, which we illustrated by coding the participants' career status.With reference to [12] coding procedure in Figure 2, three domains -(i) the entrepreneurial, (ii) professional and (iii) leadership dimensions were similarly coded.
The entrepreneurial dimension was based on self-employed status (Self-employed = 'E', All other = '0'), professional dimension was based on education qualifications held (associate degree or higher = professional 'P', lower = 'V'), and leadership dimension was based on Occupational Classification System by the US Bureau of Labor Statistics (Managerial = 'L', non-managerial = '0') [15].These dimensions produced 8 distinct status job profiles, and their combinations were V, EV, P, VL, EVL, PL, EP and EPL.

2.2.2
Step 2: Constructing Residential and Job Profile Sequence.After coding both residential and job profiles, we collated both profiles across all 19 waves of data from 1997 to 2019 into a table.Figure 3 shows an extract of a case's (etc.ID 1410) career profile taken across 19 waves being merged into a single table.Each row represents a case, and a column represents a state for each wave (etc., year).After coding the single-domain residential and multidomain career profiles, we applied the third criterion.We ensured that every case provided at least 13 years of residence and workforce data by the 19th wave of data collection and filtered out the rest using the count function.The number of respondents excluded for the 3rd criterion is 5835.

Step 3: Importing and Converting Table to
Sequence Object and Dealing with Missing Data.After constructing the career and residence profiles in a single table, the table was imported into R to run sequence analysis using the TraMineR package.TraMineR provides a host of functions to convert and run sequence analysis with in-depth detail covered in the R manual [7].We will first conduct a sequence analysis on the residential sequence.In this step, the residential sequence was converted into a sequence object.Figure 4 shows a sample of how 2 observations are converted into sequence objects.The table consisting of the path sequences was converted into a sequence of time states as a sequence object.All N.A values on the left-hand side of the path sequence were removed accordingly so that all cases, despite being in different cohorts (etc., where data are made available from a certain year onwards), were adjusted to start based on their 1 st time state.
For example, information on the residential status (etc., urban) of 'ID 48' starts from the year 2000 onwards, while the residential   status of 'ID 1410' is only made available from 1998.When converting the path sequence into a sequence object, 'ID 48' and 'ID 1410' were "flush" accordingly, with 'ID 48' 1st time state being urban (U) and 'ID 1410' as suburban (SU).At the same time, N.A values in the middle and at the right-hand side of the sequence were converted into an Asterix (*), indicating missing data that needs to be accounted.Overall, this step highlights OMA's ability to rigorously handle censored and truncated types of data.

DATA ANALYSIS 3.1 Step 4: Calculating Dissimilarity of Residence Sequences
In this step, we determined the 'cost' of indel (insertion and deletion) and substitution for all possible permutations among rural, suburban and urban residential states before performing the OMA.The indel cost is set to 1 in this example.For the substitution costs -a matrix of costs was computed to determine all possible permutations of converting one state into another state by looking at the occurrences in the data as shown in Figure 5.In other words, should the data generally have more observations migrating from Rural to Suburban than Rural to Urban, the cost will be lower for the former (1.81) than the latter sequence (1.96).For illustration purposes, the conversion to and from N. A to other residential profiles remain at 0. Substitution costs are generally about twice as costly as indel.Researchers are free to adjust the cost based on conceptual justification with in-depth discussion covered in [11].
To illustrate how OMA calculates the dissimilarity among the sequences of cases, let's consider an illustration of cases 'ID X' and 'ID Y' in Figure 5.The conversion of 'X' state path to look exactly as 'Y' is the same as converting 'Y' to 'X', forming a matrix.Looking at both cases, the time states 1,2 and 5 have the same state, but there is a need to deal with states 3 and 4, which can be accomplished with the two closest options.The 1 st involves removing 'Y' urban (U) profile in time state 3 and adding a rural (R) profile after state 5, followed by shifting the alignment flushing to the left with a total cost of 2. The 2 nd option is to substitute 'X' sub-urban in time state 3 for urban (U), and in time state 4, substitute rural (R) to sub-urban (SU) with a total cost of 3.76.Given that OMA finds the minimum distance metric, it registers option 1 with the lower cost of '2' in the distance matrix output.These output values are then fed as inputs for a cluster analysis technique to identify different clusters.

Step 5: Cluster Analysis of Residence Sequences
We used the distance matrix output to conduct cluster analysis.A hierarchical clustering algorithm approach validated with a nonhierarchical clustering algorithm approach was conducted respectively in R and SPSS.For the hierarchical clustering algorithm with R, we used the least computationally expensive Ward's clustering algorithm -an iterative process that aggregates observations towards a single cluster while creating more clusters towards the goal of minimizing the total within-cluster variance in each iterative process.The algorithm will create "natural groups" based on the input of the distance matrix that was computed from the OMA.We instructed the algorithm to produce a range of up to 10 clusters (the default in R) and compared the AIC (Akaike's Information Criterion) fit statistic model (etc., AIC-values) of the 10 clusters to determine the optimal cluster number [3].The optimal number of clusters refers to where the similarity between observations within each cluster and dissimilarity between clusters are optimal [13].
The optimal cluster number that best reflects the data would be where the AIC value is at the lowest point.Our analysis found that the optimal number of clusters is between 6 and 7. Thereafter, we use a non-hierarchical clustering algorithm to validate the result of the hierarchical clustering algorithm approach, specifically using a k-means clustering algorithm to test whether 6 or 7 clusters will be more appropriate.Subsequently, we used SPSS to perform Cohen's kappa to compare the hierarchical and non-hierarchical clustering algorithms.The cluster number that indicated a relatively higher and significant level of agreement was the more robust cluster showing a 6-cluster solution (Cohen's kappa = 0.21, p<.05).Thereafter, we used a modal plot to visualize the different clusters of residential patterns.

RESULTS AND ANALYSIS 4.1 Descriptive Results
In Figure 6, the OMA and clustering techniques show six meaningful residential pattern clusters.We empirically label these clusters as "Urban", "Suburban -Urban", "Suburban", "Suburban -Urban -Suburban", "Urban -Suburban", and "Rural".Our empirical labels are supported by calculating the proportion of the different residences (Urban, Suburban and Rural) of each cluster, as shown in Table 2.

Further Analysis Techniques -Single-Channel Analysis
After identifying the clusters, single-channel analysis can be conducted with a range of outcome variables of interest.In this example, we analyzed the clusters of residential patterns against several demographic outcomes such as gender, ethnicity, education, and average annual income.Gender, ethnicity, and education were analyzed using chi-square tests, and annual income was analyzed using ANOVA, with results shown in Table 3. Unlike the NLY79 data, where respondents indicated how much they earned annually with a ratio data, the annual income for NLY97 data was reported as ordinal data (etc. 1 = $1 -$5,0000, 2 = $5,001 -$10,000, 3 = $10,001 -$25,000, 4 = $25,001 -$50,000, 5 = $50,001 -$100,0000, 6 = $100,001 -$250,000, 7 = More than $250,000).
A second sequence analysis of the career profile is also performed within each cluster of residential patterns with the same 5 steps.Figure 7 highlights six meaningful clusters of job profile patterns within the "Urban-Suburban" cluster.We empirically label these clusters as "Mid-self-employed vocation career", "Persistent   = .98a The table shows the six clusters of residential cluster's demographic breakdown of sample size (N), gender, ethnicity, education, and income.Scores were computed using chi-square and univariate analysis of variance (ANOVA).*p < .05,**p < .01,***p < .001vocational career", Mid-Late vocational career", "Persistent professional", Late -self-employed vocation career and Late-Leadership Professional.Similarly, these labels are supported by the proportion of the different job profiles of EV -entrepreneurial and vocation, EP -entrepreneurial and professional, VL -vocation and leadership, P -professional, PL -professional and leadership, and V -vocation jobs, as shown in Table 4.The table shows the breakdown of the six clusters of career clusters with respective job profiles: EV -entrepreneurial and vocation, EPentrepreneurial and professional, VL -vocation and leadership, P -professional, PL -professional and leadership, and V -vocation jobs.

DISCUSSION AND CONCLUSION
In conclusion, this paper provides an overview and tutorial on Optimal Matching Analysis with archival data from the NLSY97.The paper has outlined a conceptual overview and proceeded with the 5 steps to prepare and conduct OMA with non-supervised and supervised clustering techniques.Thereafter, it shows the results from OMA, revealing 6 residential clusters.It also conducts a sequence analysis within the "Urban-Suburban" cluster, revealing 6 clusters of career path patterns.Besides the residential and career profile patterns shown in this paper, the cluster patterns can also be in the form of attitudes and behaviors with a plethora of potential insights, which can serve as predictor variables (etc., what these clusters of sequence determine?)or as outcome variables (etc., what antecedents determine these clusters of sequence?) with further analysis.It is important to note that OMA provides rich descriptions but should not be interpreted as each state causes the next state.It prepares the data for in-depth analysis, often using ANOVA and/or chi-square tests.
The use of OMA has implications for theory and empirical analysis.For theory, OMA uncovers the dynamic processes of the data, which has the potential to enrich theories.For example, OMA identified 3 more meaningful archetypes of residential path patterns (etc., "Suburban -Urban", "Suburban -Urban -Suburban", and "Urban -Suburban") in addition to the expected residential path patterns of "urban", "rural."and "suburban".Furthermore, OMA can uncover nuances in the movement of individuals over time.For example, the "Suburban -Urban -Suburban" cluster shows a mix of urban, suburban and rural from year 1 to 6 (Y1 -Y6) that stabilize to a suburban profile from year 7 to 14 (Y7 -Y14), and thereafter to urban profile for the remaining years.This example shows that OMA can shed light on the granular processes of attitudes and behaviors.
In terms of implications for empirical analysis, OMA complements longitudinal analyses in uncovering phenomena that do not rely on correlational techniques.As a non-parametric analytical approach, OMA does not assume a specific distribution of the data, thereby allowing several types of data to be analyzed, especially categorical or ordinal data.The algorithms are robust to the influence of outliers.It can also deal with censored or truncated data, which is common in longitudinal research.

Figure 3 :
Figure 3: Combining various tables of "ID 1410" residential profiles from 1997 to 2019 into a single table.

Figure 4 :
Figure 4: Table converted into sequence object in the process of running OMA.

Figure 5 :
Figure 5: An illustration of how OMA compares the sequence of state among observations.

Figure 6 :
Figure 6: The OMA and clustering techniques revealed clusters of 6 meaningful residential patterns

Figure 7 :
Figure 7: Six meaningful cluster of career patterns revealed from the "Urban -Suburban".

Table 1 :
Single Domain Residential Profile Coding

Table 2 :
Proportion of the six clusters of residential clustersThe table shows the six residential clusters' breakdown of the rural, urban and suburban geographic residences. a

Table 3 :
Summary of the six residential clusters demographic and human capital

Table 4 :
Proportion of the six clusters of residential clusters