Video-based Contrastive Learning on Decision Trees: from Action Recognition to Autism Diagnosis

How can we teach a computer to recognize 10,000 different actions? Deep learning has evolved from supervised and unsupervised to self-supervised approaches. In this paper, we present a new contrastive learning-based framework for decision tree-based classification of actions, including human-human interactions (HHI) and human-object interactions (HOI). The key idea is to translate the original multi-class action recognition into a series of binary classification tasks on a pre-constructed decision tree. Under the new framework of contrastive learning, we present the design of an interaction adjacent matrix (IAM) with skeleton graphs as the backbone for modeling various action-related attributes such as periodicity and symmetry. Through the construction of various pretext tasks, we obtain a series of binary classification nodes on the decision tree that can be combined to support higher-level recognition tasks. Experimental justification for the potential of our approach in real-world applications ranges from interaction recognition to symmetry detection. In particular, we have demonstrated the promising performance of video-based autism spectrum disorder (ASD) diagnosis on the CalTech interview video database.


INTRODUCTION
Rapid advances in smartphone technology have greatly facilitated the acquisition and sharing of short video clips through various social media platforms such as TikTok and Instagram.Video-based action recognition [9,15,23,65,79] has been extensively studied in the literature.In contrast, the recognition of human interaction based on video [57] has remained an under-researched topic.One of the emerging applications for analyzing human interaction from short video clips is behavioral imaging [47], which aims at diagnosing behavioral disorders (e.g., autism and depression) from short video clips in natural environments.In particular, the online diagnosis of autism spectrum disorder (ASD) at home [1] has received increasingly more attention from both the technical and clinical communities.During the COVID pandemic, the conventional gold standard, such as the Autism Diagnostic Observation Schedule (ADOS) interview [38], has become unsafe to practice due to the regulation of social distancing.Instead, behavioral imaging of short video clips acquired by smartphones becomes a safer and more cost-effective alternative to ADOS interviews.
From the multimedia perspective, the key challenge in action recognition lies in the complexity of human movements and social interactions.Similarly, the diagnosis of ASD from behavioral imaging also faces challenges in the automatic analysis of social affection (SA) and restricted and repetitive behavior (RRB) from smartphone video.Both the human-human interaction (HHI) [43] and the human-object interaction (HOI) [74] are important to the task of behavioral imaging: the former reflects the subject's social interaction skills, and the latter is correlated with the person's motor as well as joint attention skills.In behavioral imaging, abnormal behaviors in HHI and HOI often carry significant information to the diagnosis of autism, e.g., lack of response to name calling [7] and sticking to the same repeating pattern during toy play [63].Other abnormal behaviors not involved with interaction, such as avoiding eye contact during conversation [24] and self-stimulating called stimming [46] are also hallmark signals for autistic children.
Although both HHI and HOI have been studied for action recognition by computer vision and multimedia communities, irregularity detection [4] or anomaly detection [17] from video clips has several unique challenges.In addition to well-known barriers (e.g., cluttered background, interference from occlusion, and invariance to viewpoint [27]), it is often difficult for even human experts to tell the subtle differences between regular hand flapping and stimming or between toy playing by a typically developing (TD) and the sameness (sticking to the same toy -a typical repetitive behavior) by autistic children.For example, an hour-long ADOS interview video, when viewed by non-trained eyes, often fails to distinguish between TD and ASD most of the time.It takes special training for health professionals, such as developmental pediatricians and behavioral therapists, to become ADOS-reliable.Teaching a computer to identify irregularities (e.g., abnormal behavior in autistic patients) from video reliably has remained a long-standing open problem in computer vision [17].
Inspired by the long-lasting impact of classification and regression trees (CART) [6], we propose to take a binary decision tree approach to address both action recognition and diagnosis of ASD in this work.New insights are borrowed from human motion analysis [3] and autism research [38] to focus on a series of basic binary classification tasks as the building block of the complete vision system.For action recognition, we have considered mutual interaction (absent vs. present), time reversibility (yes vs. no), periodicity (yes vs. no) and body movement (upper vs. lower).For the diagnosis of ASD, we have focused on modeling two types of irregularities associated with autistic behavior: breaking social interaction [68] (e.g.lack of social interaction and imitation) and sticking to the [18] (e.g., repetitive actions and self-stimulating behaviors).The shared motivation behind our approach to naive Bayesian decision trees [75] offers a unified solution to both action recognition and ASD diagnosis, i.e., for a given video clip, our objective is to produce a sequence of binary decisions constructed from domain knowledge.By fusing those binary decisions adaptively, we expect to improve the reliability and accuracy of videobased action recognition or diagnosis of ASD.
To support inference in the decision tree, we propose to construct an interaction adjacency matrix (IAM) on skeleton graphs [71] for representation learning and develop self-supervised contrastive learning [33] for binary classification.Our IAM representation can be interpreted as an extension of the graph convolution network (GCN) [37] by incorporating HHI into the adjacency matrix.Our IAM is conceptually similar to recent pairwise adjacency matrix (PAM) [72] and interaction relational network (IRN) [43] for HHI detection.However, PAM lacks interaction information from one part of a person to a different part of another person, as it only focuses on identical joint types or the center of gravity interaction between two individuals.This limitation is particularly noticeable in asymmetric mutual actions, such as when one person (dominant) punches another person's face (subordinate).In such a case, the dominant person's hand should exhibit a strong intensity towards the subordinate's face in the reconstructed interaction graph.Our IAM overcomes this limitation and, unlike IRN, eliminates the need for additional computing resources for a relation network [52] to compute the relationships between joint pairs.Moreover, our IAM contains a novel extension, i.e., the interaction between the left and right hands of a single person.Meanwhile, the new insight brought about by knowledge of the autism domain allows us to learn the concept of symmetry in interaction and imitation as a behavioral marker, i.e., autistic people often demonstrate broken symmetry in both HHI (e.g., lack of response to namecalling [36]) and HOI imitation (e.g., pretending to brush the teeth during the ADOS interview [59]).Our technical contributions can be summarized as follows.
• Formulation of the problem of video-based action recognition and diagnosis of ASD via decision trees.We aim to extract a series of basic units, such as periodicity and dominance, for action recognition and behavioral biomarkers (symmetry and sameness) to facilitate the task of detecting ASD.• Construction of a trainable ST-GCN for representation learning.We propose to learn an IAM for modeling the interaction between two people and two hands of a single person.Instead of using a learnable mask [71], we have developed a different IAM as a warm-up regularization strategy.• Development of a self-supervised contrastive learning algorithm for attribute/anomaly detection.We have built on previous work [33] to tackle the problem of periodicity/dominance/symmetry/sameness detection by constructing pretext tasks.By combining ST-GCN representation with contrastive learning, we managed to achieve a reliable and transparent extraction of features from the action video and biomarkers from the ADOS video.The excellent explainability of our approach is desirable for bridging between computing and healthcare professionals.• Experimental results are reported to demonstrate the performance of our approach in video-based action recognition and diagnosis of ASD.On action recognition, we have achieved a noticeable improvement over previous ST-GCN on several testing scenarios.Using symmetry and similarity as biomarkers, respectively, we have achieved the precision of > 80% for the task of diagnosing ASD.

RELATED WORK
Video-based action Recognition.Previous work on action recognition is based on 3D convolutional neural networks [23] or twostream convolutional networks [58].The experiments are often conducted on two popular datasets: HMDB-51 [28] and UCF-101 [60].In view of the paucity of videos in these datasets, a new Kinetics Human Action Video dataset [25] was used in [9] along with a new two-stream inflatable 3D ConvNet (I3D) architecture.More recently, the construction of larger datasets for action recognition, such as NTU RGB+D [54] and NTU RGB+D 120 [35] has inspired a flurry of more powerful architectures (e.g., graph convolutional network [12,30,71], transformer [31,44,77]) for action recognition.The accuracy of current state-of-the-art skeleton-based action recognition has exceeded 76% and 91% in Kinetics-400 and NTU RGB+D, respectively.There is also a recent study on group activity recognition [69], which emphasizes the group instead of individual actions, as well as symmetry modeling for action evaluation [16] through an asymmetric interaction module.For a recent survey on video action recognition, see [79].The latest advances in this field include recurring the transformer [73], dual-head contrastive domain adaptation [13], Direcformer [64], and learning from the temporal gradient [70].
Video-based ASD Classification.In the literature on autism research, video-based autism diagnosis dated back at least to [41] where the correspondence between home video and parent report was shown as an onset pattern in autism.Early works on behavioral imaging for autism screening [47] focused on abnormal eye contact or visual scanning of faces [42].Eye-tracking data for visual attention have been used to train a deep neural network (DNN) for identifying people with ASD in [24].The idea of ASD screening using machine learning on home video has also been explored in [1] and [61]; however, the extraction and scoring of 30 diagnostic features (e.g., eye contact, joint attention, and imitate actions) was still done by humans in [61] due to the technical challenge with automatic video analysis.Most recently, the gaze pattern extracted from the video has been utilized for early ASD diagnosis of young children in [10].Generally speaking, there is still a significant gap between human-based and machine-based analysis of ADOS interview video [38] for the autism diagnosis.When compared with standard HHI and HOI research [20,39,43,66,72,78], video-based ASD classification is lacking in both the dataset and pre-trained models.Only recently, video-based ASD diagnosis has been studied for remote telehealth assessment in [14] and with fewshot learning constraints in [76].

THE PROPOSED METHOD 3.1 Problem Formulation and Preliminaries
The architecture of our backbone: spatiotemporal graph convolutional network (ST-GCN) [71].
In the literature, video-based action recognition refers to the problem of labeling video sequences with action labels [45].Unlike multi-label classification [21], we opt to formulate the video-based action recognition and diagnosis of ASD as a binary classification problem in decision trees (as shown in Figure 2 and Figure 3) for the sake of explicitly exploiting the domain knowledge related to SA and RRB.For a given video clip, our aim is to generate a sequence of binary scores1 as the output of the decision tree.This binary score vector can be used to train various machine learning models (e.g., support vector machine and random forest [5]) for the action recognition or ASD diagnosis task.With the introduction of decision trees, we start with the formulation of a binary classification problem as the building block of our recognition/diagnosis system.
Problem Formulation.Given a segment of an action or ADOS interview video (assuming that it is parsed into an individual interview task after preprocessing), how to determine its binary actionrelated or ASD-related attribute?
To tackle the above problem, we will start with a brief review of the existing work on skeleton graph representation that has been widely studied for action recognition.A skeleton graph for action recognition is an abstraction of 17 or 25 joints based on human anatomy.The current state-of-the-art in skeleton graph-based action recognition has adopted a spatiotemporal graph convolutional network (ST-GCN) [26] architecture.However, datasets for action recognition, such as NTU-60 and NTU-120, only contain a small collection of actions from our daily lives and are only appropriate for supervised learning.We need to extend both data representation and network design for autism diagnosis by incorporating domain-specific knowledge.
Skeleton Graph and ST-GCN.Skeleton-based action recognition has received increasingly more attention thanks to the rapid advance of 3D pose estimation [62] and graph convolutional networks (GCN) [26].In this paper, we choose spatial-temporal graph convolutional networks (ST-GCN) [71] as the baseline because it has been widely studied in the literature on action recognition.The basic idea behind ST-GCN is to construct a spatial-temporal graph based on physical connections of human joints in consecutive frames.In this graph, the baseline model used nine ST-GCN blocks to aggregate information on different scales.

Interaction Adjacency Matrix (IAM)
Here, we first review the basic concept of ST-GCN construction and then introduce an interaction adjacency matrix (IAM) as the new building block for modeling the imitation of HHI and HOI.Following the notation used in [71], we can construct a spatialtemporal graph = ( , ) from a skeleton sequence with joints and frames (the output of the 3D pose estimation from the input video sequence) as follows.In this graph, the set of vertex = { | = 1, ..., ; = 1, ..., } includes all body joints (indexed by ) collected together with a skeleton sequence (indexed by ).The graph includes two types of edges: 1) the spatial edge connects two joints within the same frame based on the human body structure; 2) the temporal edge connects the same joint in consecutive frames.Formally, the spatial edge connection can be denoted as = | ( , ) ∈ , where is the set of physical connected joints; and the temporal edge connection can be denoted as = ( +1) .Note that the spatial edge connections are kept the same for all frames.In other words, we only need to construct one spatial graph for the entire sequence.Similarly to the original formulation of GCN [26], the spatial graph can be directly characterized by an intraperson adjacency matrix A ∈ R × , where A ( ≠ ) denotes the intensity of the joint to the joint , or the intensity of the edge .Like an attention mark for image recognition, the adjacency matrix serves as an adjustable kernel to embed action-related information during training.Recent GCN-based approaches [11,55,56] have explored the modality of a trainable intraperson adjacency matrix for action recognition.We leverage this idea to flexibly extract salient features for symmetry detection (e.g., whether both hands of a person are moving during the interview) as follows.
An ST-GCN block consists of a GCN module and a standard convolutional module, where the GCN module plays an important role in extracting spatial-temporal information.Let f ∈ R × and f ∈ R × ′ be the input and output features of the current layer, respectively, where is the number of joints in a skeleton graph.
and ′ are the dimensions of the input and output features separately.In [71], the skeleton graph is represented by an adjacency matrix A ∈ R 3× × with spatial configuration partitioning.The adjacency matrix can be further disassembled into three matrices A ∈ R × , where ∈ {root node, centripetal group, centrifugal group} [71].Then, the GCN module can be expressed as where 2 is the normalized adjacency matrix; and = + , where is often set to a small positive real (e.g., 0.001) to avoid empty rows in A .The learning of the trainable weight W ∈ R × ′ can be implemented by adding a learnable mask M to each ST-GCN layer.The mask plays the role of scaling the contribution of a joint's features to its neighboring joints.For each adjacency matrix, we accompany it with a learnable weight matrix M and substitute A + I in A by (A + I) ⊗ M, where ⊗ denotes the product of the matrix by element.
Recently developed GCN-based methods [11,55,56] explored the modality of the adaptive adjacency matrix to improve the performance on action recognition.However, most of them focus only on the intraperson relationship, ignoring the mutual interaction among people.This deficiency becomes more serious for the recognition of mutual actions between two people, such as the UT interaction dataset [51].To solve this problem, ST-GCN-PAM [72] introduced the inter-person correlation of the same joint between two people with a pairwise adjacency matrix (PAM).Despite its conceptual simplicity, correlation-based modeling of HHI overlooks the rich and complicated dependency among humans, especially in the presence of asymmetric mutual actions (e.g., an autistic child shows no response to name-calling).
Inspired by the need for symmetry detection for HHI, we propose an extension of the existing adjacency matrix, named the Interaction Adjacency Matrix (IAM), to effectively model the interperson interaction and summarize both intra-joint and inter-joint relationships.The IAM is formally defined based on the original adjacency matrix of [71] as follows.
where A ∈ R × , and , ∈ {1, 2} denote two people in HHI or a person with his mirror symmetry in HOI imitation.When = , A indicates the intra-person adjacency matrix (degenerate to the basic case A ) for HHI and HOI imitation.When ≠ , we have an inter-person adjacency matrix characterizing the HHI between and or the left-right adjacency matrix for HOI imitation.Without prior knowledge of the presence of any interactions, we expect the model to automatically add or remove edges during the construction of an effective graph for A 12 and A 21 .Instead of using a learnable mask M in ST-GCN [71], we introduce a difference matrix B and substitute (A + I) ⊗ M in A with A + B + I. Then Eq. ( 1) can be rewritten as where is a fixed adjacency matrix that can be interpreted as the average.Using such a decomposition of the average difference, we can easily add new edges or remove old edges by modifying the matrix B in the learning process.Furthermore, our IAM is constructed as a directed graph (a similar idea exists in directed graph neural networks [55]), where we can learn the relationship of directional interaction by comparing A 12 and A 21 for the detection of symmetry.
A 12 and A 21 can be extended to describe the relationship not only between humans and humans but also between humans and objects.In this paper, we leave the relationship between humans and objects part for future work and focus on the relationship between humans and humans.We apply the IAM mechanism to our model and improve the performance of mutual action recognition on the mutual action subset of NTU RGB+D 120 and HOI imitation tasks on the Caltech ADOS Interview dataset.

Contrastive Learning for ST-GCN based Binary Classification
As mentioned before, video-based action recognition faces several technical challenges, from the cluttered background to the lack of anomalous events during training.Multitask self-supervised learning (M 2 L) [17,22,33] has emerged as a promising solution to overcome these challenges.In [33], three tasks are integrated into the learning skeleton features for action recognition.In [17], a similar construction of three self-supervised tasks (time arrow, motion irregularity, and object prediction) is combined with the distillation of knowledge for the detection of anomalies.The common principle is to first construct proxy tasks (e.g., puzzle recognition, object prediction) and then transfer the knowledge learned from self-supervised learning to downstream vision tasks (e.g., action recognition and anomaly detection).
Inspired by the success of M 2 L [17,33], we propose to design a hybrid approach for ST-GCN based binary classification, as shown in Figure 2. Similarly to [33], our aim is to model skeleton dynamics instead of video frames through motion prediction and learning behavioral patterns by solving jigsaw puzzles.But unlike [33], we redesign the tasks of motion prediction and jigsaw puzzle by borrowing the idea of irregularity detection from [17].In the context of the diagnosis of ASD, the irregularity of behavior is specifically related to broken symmetry (in HHI and HOI imitation) or persistent sameness (in HOI).Such domain knowledge can be translated into specially designed jigsaw puzzles.For example, the sameness in HOI can be exploited to predict the time arrow at the object level [67].Social disconnection, as reflected by the broken symmetry in HHI, can lead to the prediction of directional action [32].Broken symmetry in HOI imitation can be interpreted as a kind of motion irregularity -the failure to predict the motion of the left hand based on that of the right hand.We apply ST-GCN [71] as a backbone to the MS 2 L[33] method and construct our self-supervised learning model.Similar to MS 2 L, we pre-train our model with three proxy tasks: (A) a motion prediction task, (B) a jigsaw puzzle task, and (C) a contrastive learning task.
In the motion prediction task, a sequence of consecutive frames is masked from the input skeleton sequence.We use an encoderdecoder to guide the model learning to predict the most likely poses of the masked frame.The input skeleton sequence is X = x 1 , . . ., x , and the masked sequence is X = x 1 , . . ., x ′ , ′ < .The predicted motion sequence is X = x ′ +1 , . . ., x . is the batch size.The mean square error (MSE) is used to estimate the motion prediction loss L as follows: In the jigsaw puzzle task, the skeleton sequences are divided into three segments that are randomly shuffled.The model is trained to predict the correct sequence order.Using the shared encoder, we employ an MLP as a classifier.The predicted jigsaw category is ˆ . is a one-hot vector that denotes the jigsaw label.We use a cross-entropy loss for sequence order classification, the loss L as follows: Finally, the masked input and the jigsaw input serve as transformation operations for contractive learning.Let 1 , 2 , . . ., be the output of the encoder, for any integer from 1 to , z ( −1) +1 is the original data and the sequences from z Our model is pre-trained with the above proxy tasks and then fine-tuned for the downstream binary classification.We trained our model for 80 epochs, using a batch size of 64 on two Nvidia RTX 3090 Ti GPUs.
The binary classification includes: (A) asymmetric and symmetric mutual action recognition (AandS), (B) upper-and lower-body action recognition (UandL), and (C) periodic and aperiodic action recognition (periodicity).We define the class of symmetric interactions first.In our daily lives, when two people do the same action, like hugging, shaking hands, and high-five, the social protocol dictates the symmetry of both parties in the action space.We call these kinds of actions symmetric interaction (sym).Then we define dominance.When an initiating person's action leads to the following action of another person, like pushing, kicking, and pointing fingers, the two parties take an asymmetric (active vs. passive) role in the action space -i.e., dominance (the person initiating the interaction) and subordination (the person receiving the interaction).We call these kinds of actions asymmetric interaction (asym).In the UandS task, the model is trained to classify the data into upper or lower body actions.The upper-body action refers to movements or actions that are primarily triggered by the upper part of the body, including the arms, head, and torso, such as brushing teeth, putting on a jacket, clapping, etc.In contrast, lower-body actions indicate those actions triggered by the hips, legs, and feet, such as standing up, kicking something, hopping, etc.In the periodicity task, the model is trained to classify the data into periodic or aperiodic actions.The class of periodic actions refers to those actions that repeat themselves regularly over a certain period, such as brushing teeth, clapping, hand waving, etc.The class of aperiodic actions denotes those actions that do not repeat themselves in a regular or consistent pattern, such as standing up, kicking something, hopping, etc.

DECISION TREE FOR ACTION RECOGNITION AND ASD DIAGNOSIS 4.1 Action Recognition
Action recognition is one of the most representative tasks for video understanding [79].Dozens of datasets such as NTU-120 [35], Yout-Tube8M [2], and Kinetics-700 [8] have been constructed to support the study on video-based action recognition.Despite the tremendous progress made in the past decade, an important limitation is the lack of generalization property.Most existing action recognition methods are based on supervised learning.They cannot handle sophisticated interactions at social events.There is a need to extend the existing framework of action recognition to activity recognition by following an unsupervised and continual learning approach.
Figure 2 shows the first step in this direction.Our intuition is to decompose action/activity recognition into a series of binary classification problems (the building block constructed in the previous subsection) along the pre-trained decision tree.At the root level (marked in blue), the decision is about whether the video contains mutual action.For single-person action (left branch), we can classify the action into temporally reversible (e.g., random dancing or jumping robes) or irreversible (e.g., taking off vs. putting on the clothes).For mutual action (right branch), the interaction can be symmetric (e.g., hand-shaking or hugging) or dominant (e.g., some person pushed another).Both reversible actions and symmetric interactions can be further divided into periodic and aperiodic.At a more basic level, all actions or interactions can be classified into upper-body or lower-body movements.
We note that the construction of the above decision tree is not unique or optimal.In theory, the problem of video action/activity recognition can be solved by unsupervised online learning of decision trees in a hierarchical manner [19].Unlike unsupervised clustering of face images [34], an optimal representation of human actions in the latent space has remained elusive.Some recent work [53] has adopted improved dense trajectory (IDT) [65] and pretrained 3000-dimensional feature vectors for unsupervised action segmentation.Our ST-GCN based representation, when combined with contrastive learning, has shown impressive performance on action recognition, as we will demonstrate in the next section.

ASD Diagnosis
The practical problem of ASD diagnosis can boil down to a series of binary decisions along the hierarchy, as shown in Figure 3.At the root level, lies the binary detection of human-object interaction (HOI).Based on the presence/absence of HOI, HHI and object detection are the next branching points on the tree.Then the class of HHI can be further divided into joint attention or HOI imitation, and the HOI can be further divided into repetitive behavior or normal play.To our knowledge, this is the first work to define the properties of symmetry and sameness to model the repetitive behavior of children with ASD.
Detecting symmetry in social interactions is motivated by several problems in computational social science [29].Detecting dominance in social interaction [48] or the lack of motivation to participate in social interaction [49,50] often has direct implications for the modeling of human interactions [40].In particular, autism research could benefit from detecting asymmetry of autistic children from video (e.g., no response to name-calling [36]).Another hallmark behavior of autistic children is stimming, a type of selfstimulatory behavior characterized by the repetition of physical movements such as head banging, hand flapping, and tip-toe.
The other important hallmark for autism diagnosis is sticking to sameness [18] -i.e., autistic patients tend to insist on following the same routine, playing with the same toy, etc.Under the framework of ST-GCN contrastive learning, we can formulate stimming or sameness detection as a binary classification problem with different nodes on the decision tree.Unlike the original ST-GCN design [71], we borrowed ideas from directed graph neural networks [55] and Direcformer [64] to replace the learnable weight matrix with a trainable difference matrix, allowing the addition of new edges.Such a warm-up strategy has shown the benefit of regularizing the model based on prior knowledge of the human body.

Datasets
UT Interaction Dataset [51].This was one of the first datasets collected for interaction recognition.It contains six types of twoperson interaction and is composed of 10 types of nonperiodic atomic-level actions.The six classes of interactions include: Shake hands, point, hug, push, kick, and punch.The 10 types of atomic actions are arm stretching, arm withdraw, leg stretching, leg lowering, moving forward, moving in the left and right directions.NTU RGB+D.[54].This was the first large-scale dataset for 3D human activity analysis.It was collected at Nanyang Technological University in 2016 from 40 distinct subjects, containing more than 56 thousand video samples and 4 million frames.This dataset contains 60 action classes, including daily, mutual, and health-related actions.The authors of this dataset provided two evaluations: (A) cross-subject (CS).In this setting, clips from a selected subset of actors serve as the training set and the remaining testing set; (B) cross-view (CV).Clips from two cameras from different viewpoints are used as a training set, and the remaining clips are used as a testing set.NTU RGB+D 120.[35].The original NTU RGB + D data set was expanded to 120 action classes, including 26 mutual actions (twoperson interactions).The video samples have been captured by three Microsoft Kinect V2 cameras simultaneously.The resolutions of the RGB videos are 1920×1080, depth maps and IR videos are all in 512 × 424, and the 3D skeletal data contains the 3D locations of 25 major body joints in each frame.The authors also provided two evaluations similar to the original dataset: (A) cross-subject, which is the same as before; (B) cross-setup (CSet).The clips are inserted into the training and testing set according to the pre-defined setup.Caltech ADOS Interview.This dataset was collected at CalTech from 2015 to 2017.It followed the ADOS-2 Module 4 protocol, consisting of 15 interview scenes.A total of 42 videos with a total duration of 3165 minutes were acquired; the average length of the ADOS interview is about 75 minutes (with a range of 50 to 150 minutes).All videos are scored by ADOS experts based on the following five categories with 32 elements: (A) Language and Communication, (B) Reciprocal Social Interaction, (C) Imagination/Creativity, (D) Stereotyped Behaviors and Restricted Interests, and (E) Other Abnormal Behaviors.The score 0 ∼ 3 indicates the severity level of the ASD behavior targeted in that question.0 means that the participant's response was at the level one would expect for a person without ASD, while a score of 3 would be highly indicative of ASD.

Interaction Adjacency Matrix (IAM) and Symmetry-Related Features
First, we demonstrate the benefit of learning symmetry-related features to mutual action recognition.To facilitate the study, we manually separate the mutual action subset of NTU RGB+D 120 into asym and sym classes (assuming an oracle exists) and conduct mutual action recognition experiments on them separately.ST-GCN is the backbone of this experiment, and the baseline is the standard mutual action recognition.The accuracy result is shown in  1: Comparison of the accuracy between the weighted sum of the performance of asymmetric, symmetric, and standard mutual action recognition on the mutual-actions subset of NTU RGB+D 120."Asym" and "Sym"denote the results on the subset of asymmetric and symmetric interaction, respectively.
Figure 4: Visualization of the IAM learned from asym and sym mutual action recognition on NTU RGB+D 120."Original" indicates the input IAM according to the physical connection."asym" and "sym" denote the IAM learned from asymmetric and symmetric interaction, respectively.The columns suggest the ST-GCN blocks.
the sum of the correct predicted sample number of asym action recognition and sym action recognition, dividing the total number of testing samples.As we can observe from Table 1, in both CSet and CS settings, the recognition accuracy with the activated IAM module outperforms that without the activated IAM for the Asym and Sym cases, as well as the baseline and weighted sum scenarios.These results have justified the benefit of the newly designed IAM for mutual action recognition.Additionally, we note that the accuracy gap between Asym and Sym is as large as 15%, which implies that Asym is the most challenging case for mutual action recognition.Consequently, the performance gained by our IAM module can improve by 1.8% and 1.4% for Asym in the crosssubject and cross-setup settings, respectively.This means that our IAM module can effectively capture the asymmetric interaction between two people by off-diagonal submatrices A 12 and A 21 .Next, we show the feasibility of symmetry detection.As mentioned above, the symmetric interaction is supposed to have more similar characteristics between A 12 and A 21 , which means that the
edges of person 1 (P 1 ) to person 2 (P 2 ) are similar to the edges of P 2 to P 1 .In this experiment, we initialize A 12 and A 21 with zero and train our model to learn IAM from the subset of asym and the subset of sym on mutual actions of NTU RGB + D 120, respectively.Visualization results of the IAM learned from asym recognition and sym recognition are shown in Figure 4.As we can learn from the visualization, the distribution of the edge intensities in "asym" is disorganized, while those in "sym" are better aligned along the diagonal of the matrix ("asym" appear more random than "sym" as shown by lighter background color and more noise-like patterns).Furthermore, we have used cosine similarity and Euclidean distance to quantify the similarity between A 12 and A 21 .Figure 5 shows the similarity and distance results calculated from the 9 ST-GCN blocks.It can be clearly seen from the comparison of cosine similarities that there is a striking difference between the classes of asym and sym (except for the 6 th and 8 th layers).Especially in the 9 th layer, the cosine similarity between asym and sym is highly distinguishable (-0.2 vs. 0.5).Similarly, the Euclidean distances of sym are uniformly less than those of asym, suggesting that the IAM can better preserve the similarity property of sym during training.These experimental findings have justified the feasibility of distinguishing sym from asym and have shown that IAM can separate them apart in the latent space of ST-GCN.Finally, we show the effectiveness of IAM and symmetric-related features.Note that the objective of this work is not to advance the state-of-the-art (SOTA) in action recognition but rather to advocate the importance of understanding symmetry and dominance in mutual action recognition.Table 2 shows the results of our model (with and without the IAM module) and the SOTA approach on the mutual actions of NTU RGB+D 120.Our proposed IAM module can still outperform most of the competing methods.Note that our method has achieved highly competitive performance to GeomNet [39] even without prior knowledge or Riemannian embedding.

Downstream Binary Classification Tasks
In this section, we first demonstrate the benefit of learning binary classification tasks, including asymmetric and symmetric (AandS), upper-body and lower-body (UandL), and periodic and aperiodic (periodicity).And then we show the performance of our model in binary classification tasks on the NTU RGB-D dataset.Similar to the demonstration of the benefit of learning symmetry-related features in Sec.5.2, we manually separate the NTU RGB+D dataset into two classes (assuming that there is an oracle) depending on the binary tasks and perform action recognition experiments on them separately.ST-GCN is the backbone of this experiment, and the baseline is standard action recognition.The accuracy result is shown in Table 3.When comparing the performance of AandS in Table 1 and the performance of UandL and Periodicity in  by at least 1% separately.There is less improvement from the baseline with UandL tasks since the model has a better ability to extract UandL features (see Table 4).The results of the binary classification task are shown in Table 4, with "NTU60" and "NTU120" denoting the NTU RGB+D and NTU RGB+D 120 datasets, respectively.The evaluation strategy used is indicated by "CS" for cross-subject, "CSet" for cross-setup, and "CV" for cross-view.There are two models that we applied in this section: "baseline" (ST-GCN) and "+MS2L" (the backbone is ST-GCN).The MS2L is a self-supervised learning method.It utilizes three proxy tasks, including a motion prediction task, a jigsaw puzzle task, and a contrastive learning task, to pre-train the model and then extract the information hidden within the skeleton sequence.We fine-tuned the pre-train model with the binary classification tasks.As we can observe from Table 4, UandL achieves the best performance, as there are significant differences in spatial and temporal dimensions between upper body movements and lower body movements.Furthermore, even though MS2L is a model finetuned by downstream tasks, it achieved similar performance with the baseline around a 1% accuracy gap (except for the performance on CS in the periodicity task).This suggests that the hidden information extracted by the proxy tasks of MS2L is important for binary classification.
We also tested dominance detection on the UT dataset.As shown in the tree in Figure 2, the detection of dominance is the followup task for asym.We have applied ST-GCN and our contrastive learning model for this task.Due to the relatively small size of the UT-interaction dataset, we combine UT-1 and UT-2 for dominance detection.There are four types of asym (kicking, pointing, punching, and pushing) in the UT-interaction dataset.Table 4 shows the performance of our model for dominance detection.

ASD Diagnosis on Caltech ADOS Interview
To verify the effectiveness of our model, we applied it to HOI imitation task, which is one of the 15 interview scenes in the Caltech ADOS Interview dataset [76].Specifically, the participants are asked to demonstrate and describe how they brush their teeth.We pre-train the model using the NTU RGB+D 120 dataset and finetune it with the HOI imitation task data.Five-fold cross-validation is applied to estimate the performance of our model on the ASD diagnosis (ASD vs. typical development binary classification) based on atypical motor behavior, shown in Table 5.
To explain the performance of the action recognition model in ASD diagnosis, we visualize nine adaptive graphs (each from an ST-GCN block) learned from the HOI imitation task in Figure 6.The graphs show which edges are significant and contribute to the classification.The brighter the line, the larger contribution to the classification.A joint with a higher degree of bright edges is more significant for the classification.In the HOI imitation task, right-handed participants use their right hand to pretend they are holding a toothbrush.These low-level features of the right-hand movement are extracted by the shallowest ST-GCN block (block1).As the block goes deeper and the receptive field becomes wider, the joint of the left hand becomes more significant (block2 to block6).Finally, both left and right hands are taken into account, and the left hand is the most significant feature in the deepest layer (block 9).These visualization results are strong evidence showing that the left-hand feature plays a significant role in the ASD diagnosis of a right-handed person.As shown in Figure 7, due to the atypical motor behavior, the participants with ASD usually avoid expressive gestures on their left hand, like gripping their fists.In sharp contrast, participants with typical development usually use express gestures on their left hand to describe how they brush their teeth.We have verified that this is indeed the most salient feature used by clinicians while inspecting the task of HOI imitation.By extracting symmetric features, the introduction of IAM can avoid the bias between left-handed and right-handed individuals and outperform the baseline.

CONCLUSION
In this paper, we propose a decision tree-based contrastive learning framework for behavioral imaging and demonstrate its applications in action recognition and diagnosis of ASD.Using a skeleton graph as the data representation, we have constructed an interaction adjacent matrix on the graph convolutional network as the backbone for modeling action-related attributes.Through the construction of various pretext tasks and loss functions, we have designed a series of binary classification nodes on the decision tree, which can be combined to support higher-level vision tasks such as action recognition and ASD diagnosis.For action recognition, we have focused on the experiments with periodicity and dominance attributes; for ASD diagnosis, we have studied symmetry and sameness attributes.Our preliminary experimental results have shown the promising performance of the decision tree-based  contrastive learning approach on four binary attribute classifications (AandS, UandL, Periodicity, and Dominance) on UT Interaction, NTU-60, and NTU-120 datasets.For ASD diagnosis, our approach has achieved a promising > 80% accuracy on the challenging CalTech ADOS Interview database.In addition to good accuracy, transparency of our ST-GCN based approach is another desirable property because interpretability matters in ASD-related clinical practice.However, the effectiveness of our decision treebased contrastive learning method is contingent upon the quality and quantity of binary attributes.As we continue our research, future work includes the expansion of constructed decision trees (i.e., extraction of more action-related and ASD-related binary attributes), optimization of contrastive learning loss functions and hyper-parameters, and attention-based fusion of multiple nodes to support multi-label action recognition and fine-granularity diagnosis of ASD.

Figure 2 :
Figure 2: Extending binary decision tree for multi-class action recognition (note that an -class action recognition can be decomposed into a binary decision tree with a height of approximately 2 ).
( −1) +2 to z are the transformed samples from the original sequence.z = 1 =( −1) +1 z indicates the mean features of the original and transformed data for ( −1) +1 .(•) is the cosine similarity.( − 1) is the number of transformation operations.Positive pairs are constructed from the transformation operation, while negative pairs are constructed with other samples.With = , the contractive loss L is formulated as follows:

Figure 3 :
Figure 3: Construction of a binary decision tree for video-based autism spectrum disorder (ASD) diagnosis.Note that we have only drawn a partial tree here for illustration purposes: we will focus on the leaf nodes of symmetry and sameness in this work (other non-leaf nodes, such as joint attention and stimming detection, are left for our future work).

Figure 5 :
Figure 5: Symmetry and asymmetry comparison.(a) cosine similarity.Symmetry has a higher similarity between A 12 and A 21 than that of asymmetry.(b) Euclidean distance.The symmetry has a less distance between A 12 and A 21 than that of asymmetry.

Figure 6 :
Figure 6: Visualization of significant edges in graph learned from HOI imitation on Caltech ADOS Interview dataset.The brighter the white color of the line, the larger contribution to the classification result.The orange edges indicate the initialization of the adjacency matrix according to the physical connection.The columns suggest the ST-GCN blocks.

Figure 7 :
Figure 7: Example of gesture differences between ASD and TD of HOI imitation on Caltech ADOS Interview dataset.

Table 1 .
Asym + sym indicates the weighted sum, calculated by

Table 3 :
Comparison of the accuracy between the weighted sum of the performance of UandL and Periodicity binary classification tasks on the NTU RGB+D.

Table 3 ,
we learn that Periodicity and AandS features are important for the action recognition problem, since they can improve performance

Table 4 :
Performance of each binary classification task.

Table 5 :
Performance of IAM on HOI imitation scene of Caltech ADOS video dataset.