Is Someone There or Is That the TV? Detecting Social Presence Using Sound

Social robots in the home will need to solve audio identification problems to better interact with their users. This article focuses on the classification between (a) natural conversation that includes at least one co-located user and (b) media that is playing from electronic sources and does not require a social response, such as television shows. This classification can help social robots detect a user’s social presence using sound. Social robots that are able to solve this problem can apply this information to assist them in making decisions, such as determining when and how to appropriately engage human users. We compiled a dataset from a variety of acoustic environments that contained either natural or media audio, including audio that we recorded in our own homes. Using this dataset, we performed an experimental evaluation on a range of traditional machine learning classifiers and assessed the classifiers’ abilities to generalize to new recordings, acoustic conditions, and environments. We conclude that a C-Support Vector Classification (SVC) algorithm outperformed other classifiers. Finally, we present a classification pipeline that in-home robots can utilize, and we discuss the timing and size of the trained classifiers as well as privacy and ethics considerations.


INTRODUCTION
Imagine you are walking around the house when you stumble upon a door that is slightly ajaropened just enough so you can hear, but not see, what is going on inside.Opening the door to see if it is appropriate or not to enter is self-defeating.If you do not hear anything, then it is very 47:2 N. C. Georgiou et al.
difficult to make any judgments.Suppose, however, that you hear human speech from behind the door.This piece of information can give you insight and can help you in your decision-making.
However, knowing that there is human speech is not enough.Many lower-level characteristics, as well as higher-level conceptual components of this speech, might be important factors in your decision.Do you recognize the voices?Does the speech sound serious or is it more lighthearted?Is there shouting or is the tone normal?What emotions can you detect from the speech?How many people can you hear?If you hear two friendly sounding people having a chat, then you might be more inclined to knock.If you stop by to relay a message and you hear yelling coming from the room, then it is probably best to steer clear for now.But, imagine that the yelling is from an enthusiastic sportscaster describing a sporting event or that the serious tone that you hear is from a dramatic soap opera.You might make a different decision if you know that the speech is coming from a television show rather than from physically present people in the room conversing.This is an important component of the speech that will influence your understanding of the situation and can affect how you interact, if you do.
Similarly, a social robot that is designed to interact with users in realistic and appropriate ways should have the ability to make this disambiguation.The robot can benefit from knowing whether the speech coming from behind the door is from a physically present human socializing.More generally, knowing when speech is a product of at least one co-located person conversing, or not, can assist social robots in making inferences about users' activities and can help them accommodate their users through a better understanding of their environments.This article focuses on whether there is (1) natural conversation occurring that includes at least one co-located user or (2) media playing from electronic sources that does not require a social response.These are common speech scenarios in the home that can assist the robot in detecting the social presence of a user through what the robot hears.
In practice, we imagine countless settings where the ability to make such a classification could be utilized by robots to assist them in accomplishing their goals.For example, a social companion robot in the home may decide to engage a co-located user with a supportive, social interaction if it infers that the user is upset, as opposed to if it knows the speech is media.A robot assisting people with Autism Spectrum Disorder may not interrupt when a user is engaged in natural conversation (to encourage social interaction) but may attempt to engage if it suspects the user is watching too much media.A customer service robot may decide whether or not to head in the direction of customers chatting in a store or may choose to disregard the speech if it is coming from a TV.An in-home robot may reach out for external assistance if a user is distressed but may not if it realizes the speech is from an action movie on TV.Depending on the end goals of the system, the robot can use such a classification, along with other prudent factors, to help it in making decisions.
To precisely characterize the differences between audio from natural and media scenarios is a challenge.Both of these audio categories contain human voices.Both categories contain diverse audio with similarities that make it difficult to quantify how we, as humans, usually know which of the two we are listening to.One potential discriminatory criterion, for example, is the speech patterns in the scripted conversation of television shows, as opposed to the more spontaneous nature of impromptu conversation.This could be sufficient for categorizing a sitcom as media, but this does not help us in correctly classifying a radio podcast where the host is casually interviewing a guest.One could also try to make this classification based on if they hear cleanly engineered audio, like that produced in a studio, versus the noisy, distorted natural audio environments of everyday life.This can help with correctly classifying a TV show or movie played on a good sound system as media but will not help when listening to sports, which involve crowd and audience noise.Solely detecting the presence of electronically sourced audio (i.e., coming from the speakers of a computer or television) is also not enough.Video calls with friends are natural situations in which Is Someone There Or Is That The TV? Detecting Social Presence Using Sound 47:3 there is electronic-sourced audio, along with at least one organically sourced (i.e., coming directly from human vocal cords) speaker playing an active role in the conversation.If we know that some part of the audio is organically sourced, then we can be sure that there is a co-located, physically present person talking.But, it can sometimes be tough to know if this is the case, especially if electronic audio sounds natural (e.g., conversational) and is played on a high-quality sound system.Making the classification between audio that is natural or media is hard.
For this article, we focus on being able to classify between natural and media audio from the dynamic environment of the home.We focus on differentiating between speech from popular genres of media that is originating from loudspeakers and speech from natural conversations including at least one co-located person in the home.Ideally, robots in real-world environments would have the ability to make this classification, regardless of the acoustic environments they are in (e.g., different rooms, different loudspeakers, distances from the audio source) and the different audio content that they hear (e.g., different voices, different TV/radio shows, background noise).Social roboticists that deploy robots in the home and intend to use audio to make decisions on how their robots interact with users can benefit from this work.
Our main contributions are: -Describing a salient audio problem that social robots in the home face: the classification between (a) natural conversation including at least one co-located user and (b) media playing from electronic sources that does not require a social response -Training classifiers 1 that use in-home audio to differentiate between natural and media and evaluate how well the classifiers generalize to new recordings, acoustic conditions, and environments -Proposing a classification pipeline that can provide additional, situational context to a social robot by assisting it in detecting social presence using sound The organization of the article is as follows: Section 2 offers background and related work.Section 3 describes the methodology in collecting the dataset, in selecting and extracting features of the audio, and in selecting the classification algorithms.Section 4 describes the experiments used to test the generalizability of the classifiers and discusses the results.Section 5 discusses how these classifiers can be applied in practice, with details on timing and size of each, a proposed classification pipeline, and a discussion on ethics and privacy considerations.Section 6 discusses some limitations of the work, and Section 7 concludes the work.the speech command to the cloud for natural language processing [26].These features inform the assistant's decision-making policy to effectively and appropriately respond [29,35].These in-home systems do not incorporate much, if any, contextual awareness of their surroundings [40].In fact, these systems typically require specific and explicit user prompts to engage them (e.g., "Alexa").Because these systems are user-initiated, the detection of social context is much less necessary.Yet, for systems designed to interact with users autonomously, the ability to garner context about the environment is crucial [28].
We believe that virtual assistants can also benefit from the ideas presented in this article, especially if developers believe there is value in additional functionality that includes behaving more socially and independently.Although we will focus on social robots in this article, we note that social presence through sound can be of use to any device in the home that could utilize such context to help it make decisions.

Using Audio for Activity and Event Detection in the Home
Automatic recognition of user activity in dynamic, unstructured environments, like the home, is important for systems whose primary purpose is to support their users through social means.Having some understanding of a user's activity and social context can help the system in its decisionmaking.
Audio scene classification (ASC), or the identification of the environment or activity based on acoustic signals, is important for robotics and can help better facilitate human-robot interaction [3].ASC has become a trending topic with growing interest because of the advent of smart homes and robots [14,45,47].In recent years, audio analysis capabilities have been added to assistive robotic systems, such as the TIAGo service robot [19] and RiSH, a robot-integrated smart home for elderly care [13], with the goal that audio will provide more contextual awareness.Work for audio analysis in the home includes activity detection specific to helping the elderly by detecting falls [38] or by identifying common activities to help medical staff monitor people who utilize ambient assisted living services [2,11,36].Audio scene classification has also been used in the context of differentiating between specific kitchen sounds such as the mixer, dishwasher, and utensils clanking [45], bathroom sounds such as showering, washing hands, and flushing [9], breathing or snoring [17], or common sounds including keyboard typing, applause, and phone ringing [42].Traditional machine learning classifiers have been used for these classifications with success.
Work has also been done that involves classifying in-home audio with the help of humans-inthe-loop.Some of this work includes human-assisted sound event recognition for home service robots for the elderly, where a human caregiver helps provide a robot with in-the-loop labels to non-voice sounds to help a robot actively learn auditory events [12].Additional work has used audio to classify different rooms in the home, such as the kitchen and office, and also discriminated between nonverbal sounds such as clapping and one-word speech scenarios [30].
The research area of voice activity detection (VAD) looks to classify between audio that contains speech and non-speech [20].Research has been done to use noise cancellation to better implement VAD on smart home devices [22].Other VAD work includes enhanced speech detection for humanoid robots in sparse dialogue [24] and robust classification between speech and non-speech [39] in noisy environments.Work has been done to recognize emotional states from speech using a support vector machine [41], to separate speech from music [1], and to detect and classify noises in speech signals [33].
There has also been research looking into how to accurately discriminate between speech commands produced from an electronic speaker from organic human speech [6].This approach was discussed in the context of cybersecurity to better identify replay attacks of certain commands on Internet of Things devices, by focusing on determining the origin of pre-written speech commands, but does not focus on in-home, noisy experimentation.
Our work presents a new tool that can be used by robots in the home to gather more social context about a user's social presence through sound when presented with human speech.The classification between natural and media that we focus on in this work encapsulates common speech scenarios in the home that can give insight into people's activities.Our experimentation focuses on real-world audio recorded in noisy, in-home environments, and this work adds to the research area of activity detection in a dynamic environment.

Audio Classification of Media
Work has also been done in the classification of different forms of media.Audio information has also been utilized when researching genre classification in different forms of media.Music information retrieval methods have explored classifying songs into genres such as pop, rock, or blues [5,43], and television media classification has classified videos into genres such as cartoons, news, or weather forecasts [15].A key aspect of many of these media approaches, along with the in-home activity detection of Section 2.2, involves extracting time and frequency domain features (e.g., spectral contrasts, spectral roll-offs, Mel-Frequency Cepstral Coefficients, or chroma features) from the overall audio signal and using these features to inform and train machine-learning classification algorithms.We build on this work by using similar features in our analysis, and we discuss more background and motivation of the feature selection in Section 3.2.

METHODOLOGY
In this section, we describe how we (a) compiled an audio dataset containing the natural and media classes, (b) extracted features from each audio sample, and (c) selected the machine learning classifiers that we experimented with.We define two terms that we will be using throughout this article.First, when discussing a sample, we are referring to a 5-second segment of audio that has been recorded and is used in feature extraction.A recording is a collection of contiguously captured samples during a given time window.

Audio Sample Collection
We collected audio content from various television genres and radio shows (sound from electronic speakers) and human speakers (sound from human voices).The final dataset contained approximately 30 hours of audio recordings and was well-balanced between the media and natural classes.
Both categories were recorded on Kinect One microphones.This was important, because any decisions made by a machine learning classifier would be able to focus on the difference of the audio content, rather than discrepancies caused by different recording hardware.

Media Recording Set.
Our media (M) recording set consisted of a variety of TV shows or radio recordings that we recorded on the Kinect One. 2 We focused on collecting audio recordings from popular television genres, which include drama, comedy, participatory/reality, news, and sports [46], as well as audio from radio shows.This category was recorded in different rooms, using a variety of electronic speakers, 3 with the microphone capturing audio at varying distances from the speakers during different contiguous time windows.Recording during different time windows allowed for different background and ambient noise to be captured as a part of the various recordings.All audio recordings were recorded at a rate of 16 kilohertz (kHz) in the waveform audio file format (.wav).
Each room, speaker, and microphone position configuration is referred to as its own unique label.These different recording configurations emulate a variety of recording conditions that an in-home agent might face.The distribution of the audio in each label can be seen in Table 1.There are 60 media recordings in our dataset, with a total of 10,138 samples, for around 14 hours of audio.Depending on the experiment that we performed, a different split of the recordings in the media set was used as training and testing data (explained in more detail in Section 4).

Natural Recording
Set.The natural recording set can be broken down into three categories: CHiME5 (C), Video Calls (V), and Family Conversations (F).
Natural Audio from CHiME5.Category C recordings were composed of content from the CHiME-5 dataset [4], available online.CHiME-5 contains audio captured from dinner parties in different houses.Each dinner party involved a different group of four people, who were told to engage in natural conversation in the house's kitchen, dining room, and living room for at least 2 hours.
Category C contained audio from 10 different CHiME-5 sessions.Each session contained audio from six Kinect microphone arrays, placed in different locations (bedroom, kitchen, living room) in each home, with audio input from each channel of each microphone.We used audio from the different Kinect microphones within the same dinner party in our dataset, because we wanted a diverse set of audio captured from different locations with varying acoustic properties.For the C category, we considered a recording to be all of the audio collected from a unique CHiME-5 session.The CHiME-5 audio files were in the waveform audio file format (.wav), with a recording rate of 16 kHz.We chose CHiME-5, because it captured natural, social scenarios that one can expect to find in a home environment.We input the CHiME-5 files directly into the classifier, because this is how natural audio would be captured by the robot.In total, category C contained 10,130 samples (1,013 samples per recording).This sample number is equivalent to approximately 1.4 hours per CHiME-5 session, for a total of almost 14 hours of audio.Samples from the C category were used as our natural training data.
Natural Audio from Our Home Environments.We also captured natural audio from our own homes.We had Institutional Review Board approval to record audio in homes and to extract and analyze acoustic features.There were two categories that we experimented with, involving natural scenarios from six rooms in three different homes.We left a recording microphone in locations that we deemed appropriate for an in-home robot or device to be placed, recorded audio and later inspected the audio.Audio from these two categories was used as our natural testing data.
Category V captured audio from video calls taking place in a home's office, dining room, and living room.These recordings involved conversations between members of a family consisting of two children and three adults.Members of the family congregated in their dining room and spoke over a video call on a laptop and phone using Zoom or Facebook Messenger.The calls were all on speaker.As a result, voices were variably distant from the microphone, and the recordings captured by the Kinect included a mixture of voices coming from an organic source (the person in the same room as the Kinect microphone) and from electronic sources (the people on the video call).The same person was physically in the room with the Kinect for each of these recordings.Category V included six separate recordings, with a total of 917 samples.
Category F consisted of audio collected from family conversations in kitchens and living rooms in three different homes.The microphone was placed close to where people were dining and conversing.An example location for the microphone was on a counter in an open, spacious kitchen.The kitchen recordings included some background noises such as the running sink, Is Someone There Or Is That The TV? Detecting Social Presence Using Sound 47:7 There are multiple reasons that we decided to also collect natural audio that we recorded ourselves, despite having an extensive corpus of in-home, natural audio from CHiME5.Even though we tried to collect our media sample set with similar recording characteristics (i.e., microphone and sampling frequency) to CHiME5, we wanted to see whether or not classifiers trained solely on CHiME-5 could generalize to classifying other natural audio from outside of that corpus.This could show that these classifiers are able to correctly disambiguate between natural and media recorded by us, and that the classification is not just a result of some discrepancies in how CHiME-5 was collected and how we recorded our audio.Last, we wanted to be able to experiment with the case of social presence that includes a mixture of electronic audio and organic-sourced natural audio, captured in the V dataset.This circumstance indicates social presence, because at least one user that is co-located with the robot is engaged in a natural conversation, while chatting on a call with others.Samples from the V and F categories were used as our natural testing data.

Feature Extraction
We split our entire audio dataset into 5-second samples.From each sample, we extracted features to create an input vector that was used to train machine learning classifiers.We used the LibRosa Python package [31] to extract audio features.These are commonly used features in audio analysis (as mentioned in Section 2.3), which was the motivation for using them.
In total, 83 features were extracted from each audio sample.We performed a standard transformation of each feature to normalize the feature set.The input vector contained the features below for each audio sample: -Mel-frequency cepstral coefficients (MFCCs): These are dominant features that have been historically used in speech recognition, and they have been explored in separating music and speech [27].It is typical that 13 coefficients are used for speech representation [43], so we use the means and standard deviations for each of the first 13 coefficients over the sample, for a total of 26 features.
-Chroma Energy Normalized Statistics (CENS): These are features that have been used in audio analysis research to match similar audio [34].There are 12 chroma classes, and we use the mean and standard deviation for each chroma class over the sample, for a total of 24 features.-Root-mean-square (RMS) energy values: Energy features are commonly used in audio analysis, with some prior work finding that the combination of energy with MFCC is better than using MFCCs alone [23].We use the range, standard deviation, and skewness of this feature, for a total of 3 features.-Zero-crossing rates: These are features that are commonly used in audio analysis [15] and can help provide a measure of noisiness of the audio sample [43].We use the mean, standard deviation, and skewness, for a total of 3 features.-Tempo: This feature estimates the beats per minute in the audio sample.The motivation behind adding this is that music from TV or radio commercials typically have more tempo than conversational audio in the home.This is 1 feature.-Spectral centroid, flatness, rolloff, and bandwidth: These are also commonly used low-level components of the audio signal [10,43].We use the mean, standard deviation, and skewness for each, for a total of 12 features.-Spectral contrast: These are features that have been shown to discriminate among different music genres [23], so we use the means and standard deviations for seven sub-bands, for a total of 14 features.
Note that none of these features involve transcription or semantic representation of dialogue/words in the audio environment.This way, the audio is translated into a machine-readable format that has little to no meaning to a human, as opposed to words, which are used in lexical analysis in Natural Language Processing.This is an arguably less invasive and more privacy-sensitive approach than using words, especially if the robot is intending on sending the input vector to the cloud to be analyzed.

Classification Algorithms
In our experiments to determine if our classification problem can be solved, we trained and tested different models with six traditional machine learning classification algorithms, using the sci-kit learn Python library [37].These are commonly used algorithms for audio classification tasks (see Section 2 for more details).We performed an experimental evaluation of various approaches to see which classifiers would be best suited to tackle the problem.We experimented with the following algorithms: -KNeighborsClassifier [18] -DecisionTreeClassifier [7] -QDA (Quadratic Discriminant Analysis) [21] -Logistic Regression [49] -GaussianNB (Gaussian Naive Bayes) [48] -SVC (C-Support Vector Classification) [8,16] We use these traditional classifiers instead of deep learning techniques, which have gained popularity in recent years in the audio analysis space for multiple reasons.First, our dataset is modestly sized, and traditional ML algorithms have a much better chance at performing successfully than deep learning when the dataset is not very large.Second, we know the feature space that we want to use for this classification task.Last, we are hoping to be able to use these trained classifiers on real-time systems, so the response time needs to be quick and the complexity and space taken by the classifier needs to be reasonable (many social robots have limited compute power).
Is Someone There Or Is That The TV? Detecting Social Presence Using Sound 47:9 A gridsearch on each classification algorithm measured what hyperparameter combination was the best for each algorithm on our first experiment (described in Section 4.1).The different hyperparameter combinations for each classifier that were experimented with can be found in Appendix A. The hyperparameters that led to the highest performance, and were subsequently selected for the classifier in all of the following tests, can be seen in Appendix B.

EXPERIMENTS AND RESULTS
In this section, we describe how the various classifiers performed on experiments that tested the classifiers' abilities to generalize to novel recordings, environments, and conditions.We test how well classifiers perform on a leave-one-recording-out cross-validation, where we test on recordings that were left out of the training set.We also test how well the classifiers generalize to classifying natural recordings from outside of the training corpus and to media recordings from (1) rooms, (2) speakers, (3) microphone positions, and (4) combinations of all three, that they were not trained on.

Leave-one-recording-out Cross-validation
We performed an evaluation similar to a leave-one-out cross-validation (LOOCV), but in our case, leave-one-recording-out cross-validation (LOROCV). 4To perform LOROCV, we trained models using natural recordings from our C category and media recordings from our M recording set.For each fold of LOROCV, we trained on all recordings except for one from C and one from M. We did this for all possible pairs of recordings from C and M, which resulted in 600 folds (the Cartesian product of the 10 recordings in C and the 60 recordings in M).For each fold, we tested our classifier on the (1) left-out {C,M} recording pair, (2) left-out M recording and natural audio sampled from V, (3) left-out M recording and natural audio sampled from F, and (4) left-out M recording and natural audio sampled from both F and V.Because recordings can be of different lengths, we randomly sampled from the larger recording to match the size of the smaller recording.This ensured that we had balanced test sets each time.
The metrics that we recorded for all of our experiments are below.TP is a true positive, TN is a true negative, FP is a false positive, and FN is a false negative.
-Accuracy=(TP+TN)/(TP+TN+FN+FP) -Precision = TP/(TP+FP) -Recall = TP/(TP+FN) -F1 Score = (2*Precision*Recall)/(Precision+Recall) We recorded the precision, recall, and F1 scores for both the media and the natural classes (i.e., we treated both as the positive class).Both the macro averages (arithmetic mean) and micro averages (weighted average) were recorded across all folds.The full results for LOROCV can be found in Table 13 in Appendix D, with a summary in Table 4 in Appendix C.
With LOROCV, we test on natural audio from left-out CHiME-5 sessions (new voices and rooms from new homes within the CHiME-5 corpus), or better yet, on natural audio from the V or F categories that we recorded ourselves.We also test on unseen media recordings that the classifiers have not trained on and that we have recorded ourselves.This provides insight into how the trained algorithms can generalize to classifying novel recordings of media and natural audio.

Leave Out Rooms, Speakers, and Microphone Positions in the Media Set
We can gain further insight into how robustly the classifiers can differentiate between natural and media audio if media in the training set contains recordings from different acoustic conditions (e.g., rooms, loudspeakers, microphone distances) than media in the testing set.In the experiments in this section, we evaluate how our classifiers perform when toggling which condition(s) of the media recording set to leave out of the training set.We also use the natural audio from the C category to train our models.We test on the natural V and F categories that we recorded ourselves and on the left-out media.
We left all of the media samples of a specific (1) room, (2) speaker, (3) microphone position, or (4) combinations of the three out of the training set and tested on the left out media samples and on natural samples from the V and F test categories.We matched the number of media samples in the training set with an equally distributed, random sample of 5-second samples from each natural recording in category C. We randomly sampled from all of the recordings in the larger test subset to match the size of the smaller subset.This ensured that we had balanced test sets each time.We recorded the micro and macro averages of precision, recall, and F1 scores for both the media and natural classes, as in LOROCV.The following paragraphs describe each experiment that we performed: In Leave One Label Out (LOLO), we wanted to see how well classifiers would perform when they trained on media from specific labels, or specific room, speaker, and Kinect distance configurations (see Table 1), along with natural from category C and then were tested against configurations that they were not trained on.We performed a LOLO experiment on all labels of our media data, where we trained different models using all the recordings from all combinations of labels, and tested against the held out labels.The left out media data at each fold was tested along with natural audio from the category V, F, and V+F datasets.The full results for each classifier can be found in Table 14 of Appendix D, with a summary in Table 5 of the Appendix C.
In Leave One Room Out (LORO), we wanted to see how well classifiers would perform when they trained on media from specific rooms, along with natural from category C, and then were tested against media from a room they had not trained on.This is important, because each room has a different acoustic environment and layout.The classifiers should be able to make accurate predictions regardless of if they have trained on audio from the room in which they are deployed.In LORO, classifiers test on media recordings from a room that they have not trained on, but the test set includes loudspeakers and microphone distances that they have trained on.The left out media data at each fold was tested along with natural audio from the category V, F, and V+F datasets.The full results for each classifier can be found in Table 15 of Appendix D, with a summary in Table 6 of Appendix C.
In Leave One Speaker Out (LOSO), we wanted to see how well classifiers would perform when they trained on media from specific loudspeakers, along with natural from category C, and then were tested against media from loudspeakers they had not trained on.This is important, because each loudspeaker has different hardware properties, and the classifiers should be able to make accurate predictions regardless of if they have trained on audio from the loudspeaker from which they hear audio.In LOSO, classifiers test on media recordings from a loudspeaker that they have not trained on, but the test set includes rooms and microphone distances that they have trained on.The left out media data at each fold was tested along with natural audio from the category V, F, and V+F datasets.The full results for each classifier can be found in Table 16 of Appendix D, with a summary in Table 7 of Appendix C.
In Leave One Distance Out (LODO), we wanted to see how well classifiers would perform when they trained on media from certain microphone distances from a loudspeaker, along with natural from category C, and then were tested against media from microphone distances they had not trained on.This is important, because the robot might be at variable distances from the sound source.In LODO, classifiers test on media recordings from a microphone distance that they have not trained on, but the test set includes loudspeakers and rooms that they have trained on.The left out media data at each fold was tested along with natural audio from the category V, F, and V+F datasets.The full results for each classifier can be found in Table 17 of Appendix D, with a summary in Table 8 of Appendix C.
In Leave One Room and Speaker Out (LORSO), we wanted to see how well classifiers would perform when they were tested on media rooms and speakers that they had not trained on.This is a more robust test than the previous ones.In LORSO, classifiers test on media recordings from a room and speaker that they have not trained on, but the test set includes microphone distances that they have trained on.The left out media data at each fold was tested along with natural audio from the category V, F, and V+F datasets.The full results for each classifier can be found in Table 18 of Appendix D, with a summary in Table 9 of Appendix C.
In Leave One Room and Distance Out (LORDO), we wanted to see how well classifiers would perform when they were tested on media rooms and microphone distances that they had not trained on.In LORDO, classifiers test on media recordings from a room and microphone distances that they have not trained on, but the test set includes microphone distances that they have trained on.The left out media data at each fold was tested along with natural audio from the category V, F, and V+F datasets.The full results for each classifier can be found in Table 19 of Appendix D, with a summary in Table 10 of Appendix C.
In Leave One Speaker and Distance Out (LOSDO), we wanted to see how well classifiers would perform when they were tested on media speakers and microphone distances that they had not trained on.In LOSDO, classifiers test on media recordings from a loudspeaker and microphone distances that they have not trained on, but the test set includes rooms that they have trained on.The left out media data at each fold was tested along with natural audio from the category V, F, and V+F datasets.The full results for each classifier can be found in Table 20 of Appendix D, with a summary in Table 11 of Appendix C.
In Leave One Room, Speaker, and Distance Out (LORSDO), we wanted to see how well classifiers would perform when they were tested on media speakers, rooms, and microphone distances that they had not trained on.This is the most challenging test that we perform for the classifier.In LORSDO, classifiers test on media recordings from a room, loudspeaker, and microphone distance that they have not trained on.The left out media data at each fold was tested along with natural audio from the category V, F, and V+F datasets.The full results for each classifier can be found in Table 21 of Appendix D, with a summary in Table 12 of Appendix C.

Selecting a Classifier
In general, we see that most of the trained classification algorithms perform well on our experiments.We see that most of the classifiers have average F1 scores in the 90s or 80s for a majority of the experiments.Table 2 summarizes the results for all our experiments for each classifier.

Results
. We see that SVC has the best performance on the most tests throughout our experiments.SVC has the highest average F1 score on 12 out of the 27 tests, with the highest average F1 score on 7 out of the 12 more difficult tests (where two or three of the media parameters are left out of the test set in LORSO, LORDO, LOSDO, and LORSDO).SVC has the highest performance on the F+V+M test sets on all but one of the more difficult experiments, and SVC has the highest F1 score on the F+M test sets for almost all of the experiments.On LORSDO, the most difficult experiment, SVC has the best performance on two out of three of the tests (V+M and F+M).Despite not having the highest scores on V+M, it does consistently well on the test set throughout all of the experiments.Generally, SVC is the most consistent classifier across the different test sets and experiments and is always performing with high F1 scores.
The next-best classifier in terms of leading F1 scores is QDA, which has seven of the best F1 scores.For QDA, all of these top results come in the first five experiments, where the training data includes more of the acoustic environment and conditions than in the last four experiments.QDA performs very strongly on the V+M test sets and on the F+V+M test sets for these experiments.This shows that if the training set has certain qualities similar to the test set, then QDA could be a legitimate option for classifying between natural and media.However, the classifier that performs the best when the test data is most dissimilar to the training data is SVC.QDA does reasonably well but performs overall worse than SVC in the last four experiments, especially on the F+M and the F+V+M datasets.QDA could be a good option alongside SVC if we know that the testing environment and conditions will have similarities to the training set.

47:13
DT has the top average F1 scores on four of the tests.DT does very well when classifying the V+M test set, with high scores on three out of four of the V+M tests in the more difficult experiments.Except for LOROCV, DT performs very well on the V+M test sets on all of the experiments.However, there is a significant tradeoff seen in how well DT performs on the F+M test sets.DT might be very good at classifying between natural and media with natural video calls and media in the test set, but does very poorly at classifying natural family conversations.In this regard, SVC is better overall for its consistency across both the V+M and F+M test sets.
LR has the top average F1 score on only two of the tests, however, we see that LR is able to generalize well to new media and natural audio.LR performs very well in many of the experiments, with F1 scores that are close to, albeit slightly worse than, SVC in most of the experiments.Especially in LORSO, LORDO, LOSDO, and LORSDO, we see that LR is able to perform consistently well on V and F data, with scores similar to that of SVC on the F+V+M datasets.LR does a good job at generalizing to new environments that it has not trained on for left out media data and video calls and family conversations.However, QDA is better than LR when the training set is more similar to the test set, and SVC is better than LR when the test set is more dissimilar.
KNN and GNB have the worst performances on our experiments.KNN performs the best on V+M in LOROCV, but besides that, KNN and GNB show substantially worse performance than the other classifiers.They perform particularly poorly on LORSDO, which tests how well they can generalize when training on very dissimilar media data to the test set.We would not recommend KNN or GNB, especially when compared to our other trained classifier.

Discussion
. Overall, SVC is best able to generalize to new recordings.We see this both in SVC's ability to perform well on natural data that we recorded in our own homes, which was outside of the natural audio from the CHiME-5 corpus that the model was trained on, as well as good performance of the classifier to media from loudspeakers, microphone distances, and rooms that it was not trained on (Table 21).SVC performs consistently well when tested on in-the-home, natural audio of both V and F. SVC performs with accuracies of over 85% on LORSDO, with recall scores of over 90% for natural V or F audio and recall of over 81% for media data from a different room, loudspeaker, and microphone distance than it was trained on.We believe that SVC is the best classification algorithm that we experimented with at disambiguating between natural and media.It does the most consistently well across our test sets in our experiments and does the best at generalizing to new environments and conditions that it has not trained on.
LR also performs well on both of the natural test sets and on many of the experiments but performs worse than SVC overall.QDA performs very well when tested against data with some similar characteristics to what it is trained with but does more poorly on stricter generalizability tests.DT performs very well on video calls but very poorly when tested against family conversations.KNN and GNB do not perform well.
Since QDA, LR, and SVC all perform well across all of our test sets and experiments, with QDA showing particularly strong performance when the media testing conditions have some similarities to their training conditions, it could be an option to use an ensemble of classifiers in making the natural vs. media prediction.We need to verify that the classifiers do not take too long to make predictions and that they do not take too much space in memory.If these two statements hold true, then it could be reasonable to use all three in predicting natural vs. media.We perform these timing and size experiments in Section 5.1.

PROPOSED APPLICATION
A critical criterion when selecting a classification algorithm is that it can perform in close to realtime to be suitable for a robot in the home or in the real world.A robot should provide a naturalistic and intuitive interaction for human users, so real-time classifications and responses are essential.
Taking too much time to analyze the audio environment, extract features, make predictions, and act on those predictions may negatively affect the overall interaction.Keeping these factors in mind, we (a) perform several timing and size tests on various steps of the audio collection and decision-making process, (b) suggest an overall classification pipeline for a robot to implement this approach, and (c) present ethics and privacy considerations that were taken into account for this pipeline.For these timing and size experiments, we train the classifier on the entire natural C category that we compiled and all of our media recordings.

Timing and Size Experiments
We measured the speed of feature extraction and prediction using around 45 minutes of audio data (540 5-second samples).Extracting features from each of the 540 audio samples took an average of 0.557 second (STD=0.0442second) on a Dell Laptop with an Intel i5-5200U CPU @ 2.2 GHz and 8 GB RAM.To measure the average prediction time for each audio sample, we measured the time that it took to standardize and predict the entire (540×83) input vector and divided it by 540.The trained standardization scaler had a size of 4 kB.The average prediction times and the sizes on disk for each trained classifier can be seen in Table 3.
We see that all of the classifiers that we trained have fast prediction times.DT and GNB are the fastest, with LR and QDA next, then SVC, and KNN last.However, all the classifiers, except for KNN, are considerably faster than a millisecond, so we believe that any of the classifiers would be sufficient in that respect.
With respect to size on disk, LR and GNB are the smallest, with DT as next smallest.SVC is the second largest, but still not prohibitively big.
These sizes (and predictions) are also promising in that if the dataset were to get substantially larger, then most of these classification algorithms seem like they would be able to scale and still be reasonable to use on-board and in real time.This might not be true for KNN, but that was eliminated due to its poor performance on generalization.
This also means that after recording a 5-second sample, the whole classification process could be used on-board a robot, even on one with little memory.The whole classification process, after recording a 5-second sample, can take less than a second for feature extraction, standardization, and the prediction, making it possible to use this in real time.
Furthermore, a robot could reasonably include multiple trained classifiers on disk and require less than one megabyte (MB) of space.If using an ensemble of classifiers, then the prediction time still remains substantially lower than one millisecond.Both the timing and size of the classifiers together allow for an ensemble to be used.

Classification Pipeline
In a real-world setting, we suggest our classifier be used as a part of a greater classification pipeline, shown in Figure 1.A Kinect One microphone would be required 5 along with minimal onboard computing power.All audio collection, analysis, and computation can take place locally, without needing to offload any data to online services.
The system begins by recording a 5-second raw audio stream of the environment and initializing the count variables to 0. The system stores the recording and checks it for speech. 6

If speech is
Is Someone There Or Is That The TV? Detecting Social Presence Using Sound 47:15  not detected, then the system should loop back to the start by resetting the counts and delete the recording.If speech is detected, then the feature extraction is performed, the audio is deleted, and a corresponding natural or media prediction is made.After a prediction, the corresponding count is incremented, and the other count is reset to 0. Only after X , or Y , consecutive predictions in a certain category will the decision be "final." Otherwise, the corresponding count is reset to 0. Once a final decision is output by the pipeline, the process starts again, with both counts initialized to 0. Depending on how sensitive we want the system to be to the classifier's predictions, we can alter the values of X and Y.For example, with X = 3, the classifier will have to predict close to 15 consecutive seconds (three decisions in a row) as media.This approach does not allow for one false positive to ruin the final classification, but rather the classifier would have to get the audio scene wrong three times in a row to make a mistake.
An alternative approach is to set both X =1 and Y =1, in which case the pipeline will be returning a final prediction on every 5-second audio sample unless it does not detect speech.This will give a robot using this pipeline more frequent data points to use in its final decision-making.
After the system determines whether or not the speech that it hears in its environment is media or natural, it can use this classification, along with other contextual information, to make decisions on how to act.For example, the robot could also have other tools available to it that can detect characteristics from human speech, such as tone, emotion, and intensity.The robot could also utilize context such as the time of day, the day of the week, its location in the home, the current weather, and more.
Another interesting contextual tool that could be incorporated into this pipeline is sound source localization (SSL), which utilizes the microphone array of the Kinect.SSL could help the robot get an approximation of where the speech is coming from.This extra context, combined with the natural vs. media classification, could further assist the robot in making a more informed decision on social presence and providing it with a better understanding of its environment.VAD and SSL could be combined to localize and individually classify multiple speakers in a noisy audio scene, but such VAD for multi-speaker diarization in real-world scenarios remains an open research problem [32].This classification pipeline can provide the robot with an understanding of if speech is natural or media in its environment, helping it in inferring social presence.The robot can use this information, along with other context, to make appropriate decisions about how to interact, or not, and to best accommodate its user(s) and to reach its goals.

Ethics and Privacy Considerations
In home, data is inherently sensitive, and the audio pipeline presented in our article is considerate of that.We believe our solution is minimally invasive.Using one modality (i.e., just audio) to make decisions is undoubtedly less invasive than using more.In fact, our suggested solution is computed locally (it is lightweight and would not require sending any sensitive data to online services), only needs to store a 5-second sample of audio at a time (which can be deleted immediately after features are extracted from it), and does not use any semantic representation or transcription of the audio (which could contain sensitive information) as a part of its decision-making.These are important factors that keep users' privacy in mind.

LIMITATIONS
There are several limitations to this work that we believe are important to make clear.First, the dataset that we compiled could be more diverse and representative.Our natural training data is only composed of audio from the CHiME-5 dataset, even though it does contain audio from different homes, rooms, and voices.Our media dataset contains three different rooms from within one home and five different electronic devices.Obviously, there are countless other possible devices from which audio can be emitted in the home, which were not included in our training set.Despite these limitations, our results showed that classifiers were able to make accurate media classifications on audio from recording devices, rooms, microphone distances, and combinations of the three that they were not trained on, and the classifiers were able to classify natural audio from outside of the CHiME-5 training corpus, which included new rooms and voices in the V and F test sets.Another limitation is that the recordings in our V and F categories could be more diverse and comprehensive, with the inclusion of audio from more homes, families, and people.Also, we only focus on audio from the home, when, ideally, such a classification tool should be able to make predictions in other dynamic, human environments as well.
Additionally, our dataset does not include examples of scenarios where media from television or radio shows is playing at the same time that natural conversation (which includes at least one co-located person) is occurring. 7Further testing would be needed to see how our classifiers would perform when both media and natural audio are overlaid.We did see that in situations where electronic and organic speakers are conversing with each other in the audio scene (in our video calls test category), the classification algorithms classified the audio as natural.It could be beneficial if a robot could garner more detailed context of identifying, indexing, and classifying between each organic and electronic speaker engaged in the conversation, but we leave this as a future research direction.Regardless, through our experimentation in this article, we see that the classifiers can provide important context to a robot by accurately differentiating between common speech scenarios in the home from which social presence can be implied: popular genres in media originating from loudspeakers and natural conversation including a co-located user.
Is Someone There Or Is That The TV? Detecting Social Presence Using Sound 47:17

CONCLUSIONS
Detecting social presence using sound involves being able to classify audio as containing either (1) natural conversation including at least one co-located user or (2) media playing from electronic sources that does not require a social response, such as television shows.It is important for in-home social robots to have such a capability, as the additional context can help them in their decisionmaking.We perform an experimental evaluation that tests the robustness of several traditional machine learning classifiers on data from our compiled natural vs. media dataset.We conclude that an SVC algorithm outperforms other classifiers, and we propose a classification pipeline that can be utilized by social robots in the home to help them in detecting social presence using sound.
Is Someone There Or Is That The TV? Detecting Social Presence Using Sound 47:19 We present the average F1 scores between each of the two classes across all 14 LOLO folds.For each fold, a media recording and natural C recording were held out of the training set and used in the testing set along with the natural audio from our own homes.We present the average F1 scores between each of the two classes across all three LORO folds.For each fold, all of a room's media recordings were held out of the training set and used in the testing set along with the natural audio from our own homes.We present the average F1 scores between each of the two classes across all five LOSO folds.For each fold, all of a loudspeaker's media recordings were held out of the training set and used in the testing set along with the natural audio from our own homes.We present the average F1 scores between each of the two classes across all three LODO folds.For each fold, all of a microphone distance's media recordings were held out of the training set and used in the testing set along with the natural audio from our own homes.We present the average F1 scores between each of the two classes across all nine LORSO folds.For each fold, all of a room's media recordings were held out of the training set and used in the testing set along with the natural audio from our own homes.We present the average F1 scores between each of the two classes across all nine LORDO folds.For each fold, all of a room and microphone distance's media recordings were held out of the training set and used in the testing set along with the natural audio from our own homes.
Is Someone There Or Is That The TV? Detecting Social Presence Using Sound 47:21 We present the average F1 scores between each of the two classes across all nine LOSDO folds.For each fold, all of a speaker and microphone distance combination's media recordings were held out of the training set and used in the testing set along with the natural audio from our own homes.We present the average F1 scores between each of the two classes across all 14 LORSDO folds.For each fold, all of a room, speaker, and microphone distance combination's media recordings were held out of the training set and used in the testing set along with the natural audio from our own homes.Is Someone There Or Is That The TV? Detecting Social Presence Using Sound 47:23  The table presents the results of the three LORO folds (each room column is the left-out room), and the macro averages across all LORO folds for each classifier.Only macro averages are presented, because the test sets were the same size (the left out room media set was larger than the natural testing subset, so media was sampled to match the size of the natural sets).
Is Someone There Or Is That The TV? Detecting Social Presence Using Sound 47:25 The table presents the results of the five LOSO folds (each speaker column is the left-out speaker) and the macro (M) and micro (μ) averages across all LOSO folds for each classifier.The table presents the results of the three LODO folds (each microphone distance is the left-out distance) and the macro averages across all LODO folds for each classifier.Only macro averages are presented, because the test sets were the same size (the left out microphone distance media set was larger than the natural testing subset, so media was sampled to match the size of the natural sets).
Is Someone There Or Is That The TV? Detecting Social Presence Using Sound 47:27

Table 2 .
Experiment SummaryThe table shows the average of the macro average F1 scores ((F natur al + F media )/2) for each classifier across all folds of each experiment.The table shows the average results of the trained classifiers being tested on the left out media sets along with natural recordings from the V and F categories.The classifier with the best average performance on each test set and experiment is in bold.More comprehensive results can be found in the Appendix.

Table 3 .
Classifier Size and Prediction Times

Table 12 .
Leave-One-Room and Speaker and Distance-Out (LORSDO) Summary

Table 13 .
Leave-One-Recording-Out CV Results The table presents the macro and micro averages across all LOROCV folds for each classifier.ACM Transactions on Human-Robot Interaction, Vol. 12, No. 4, Article 47. Publication date: December 2023.

Table 14 .
Leave-One-Label-Out Results

Table 15 .
Leave-One-Room-Out Results

Table 16 .
Leave-One-Speaker-Out Results

Table 17 .
Leave-One-Distance-Out Results

Table 18 .
Leave-One-Room+Speaker-Out ResultsThe table presents the macro and micro averages across all LORSO folds for each classifier.

Table 19 .
Leave-One-Room+Distance-Out ResultsThe table presents the macro and micro averages across all LORDO folds for each classifier.Is Someone There Or Is That The TV? Detecting Social Presence Using Sound

Table 20 .
Leave-One-Speaker+Distance-Out ResultsThe table presents the macro and micro averages across all LOSDO folds for each classifier.

Table 21 .
Leave-One-Room+Speaker+Distance-Out ResultsThe table presents the macro and micro averages across all LORSDO folds for each classifier.