Comprehensively Auditing the TikTok Mobile App

TikTok has become a dominant force in the social media landscape of the United States, and has spawned other social media sites emulating their algorithmically-driven short form content recommendation platform (e.g. Youtube Shorts and Instagram Reels). The short-form vertical content is designed to be consumed on mobile phones, but existing audits have predominantly, and to a limited degree, investigated TikTok using the web application. Additionally, there are no advertisements on the web version of TikTok, and as such the advertising ecosystem of the platform has thusfar largely gone unstudied. In this work we propose a technique for auditing TikTok's recommendation algorithm through interfacing with emulators and intercepting network traffic. In this way we are able to measure the personalization that comes from user-specified demographics such as gender and age and better understand how ads are delivered to these groups. Future work will investigate personalization from user interaction such as liking posts and following creators based on their interest, and will study the role that algorithmic personalization plays in ad targeting.


INTRODUCTION
TikTok, developed by ByteDance, is the fastest growing social media platform [2] and boasts a staggering 102 million users in the US [5].Despite the dominating presence TikTok has in the social media sphere and the influence it plays in public attitudes, relatively little research has been done in investigating TikTok.In this work, we propose novel techniques to audit TikTok's social media platform.
This research provides two major contributions.We contribute to the auditing landscape in general by providing a novel technique for auditing mobile apps, a frontier of auditing that has thusfar gone largely unexplored.By controlling multiple Android devices simultaneously, researchers using this technique can perform scientifically rigorous testing and gain experimental control over how hypothetical users could interact with devices simultaneously.Additionally, by intercepting network traffic researchers can understand what data is being sent to and received by the app developers, with potential security and fairness implications.
Additionally, we contribute the landscape of research on TikTok in particular.We perform experiments measuring the variance in video and ad recommendations across user-declared genders and across user-declared ages.While we find little evidence that gender impacts the video or ad delivery algorithms, we see strong support for age impacting the delivery of both videos and ads.We see that 13-year-olds in particular are delivered videos and ads that are different from those delivered to adults.
Taken together our work provides a framework for auditing the TikTok mobile app and a preliminary study demonstrating the viability of this approach and the kinds of research that it enables.Our future work seeks to understand the extent of algorithmic video recommendation and advertisement personalization based on inferred user interests.

PROBLEM
Given the explosive popularity of TikTok and the comparatively lacking amount of research on the platform, we aim to provide techniques for auditing the TikTok mobile app.Using these techniques, we answer key questions about the extent to which demographics play in the content and advertisements a user sees.
Prior work [4] has studied the platform using a sock-puppet based methodology.While this makes many aspects of auditing the platform much more straightforward, there are three problems with this approach.First, the vast majority of users do not use TikTok on the web; the short-form videos on the platform are all vertical and thus designed for mobile devices.Second, their investigation only scrapes between 90 and 150 videos, potentially not enough to see personalization taking effect.Finally, the website does not deliver ads to the user.This makes measuring the personalization of advertisements on the platform impossible to do using existing sock-puppet approaches.To our knowledge, there has been no comprehensive audit of the TikTok mobile app; while the Wall Street Journal released an investigation of the personalization algorithm on the mobile app [1], little details were provided as to their methodology and it lacked the scientific rigor of a published study.creating a series of accounts and interacting with the website version of TikTok, they were able to measure how the recommendation algorithm reacts to certain user inputs such as liking, following, and watching a video for an extended period.They also measured how user language preferences and location affects the recommendations from the algorithm.They used the Jaccard Index to measure similarity between feeds, measuring the similarities of posts, hashtags, creators, and sounds.They found that who a user follows is the most important indicator of the recommendation algorithm, followed by how long a video is watched for and what videos a user likes.
This work offers a thorough initial evaluation of TikTok, but there are several opportunities for follow-up research to provide a more comprehensive understanding of the TikTok recommendation algorithm.As touched upon in Section 2, this work audits the TikTok website, which is not where the vast majority of users spend their time on the app.This work also scrapes between three and five "batches" of 30 videos each, offering each experiment the of only 90 to 150 videos and only across a single session.It's possible that the quantity of videos scraped doesn't give enough time for personalization to take effect, which could explain their finding that the feeds of users engaging with the platform through liking or following don't see a reduction over time in the popularity of the posts that are delivered.It's possible that a user who continues to engage with certain content will, over time, see more niche content even if they initially still see mostly very popular content.Their work also does not contain any analysis of the advertising platform, since there are no ads on the website version of TikTok.We are particularly interested in the extent to which ad personalization happens on TikTok, which is a question that can only be answered through techniques to audit the mobile app.
In 2021 the Wall Street Journal released a video discussing their investigations into the role that user interests play in content recommendations [1].They found that the algorithm pushed some bots into rabbit holes of sadness/depression content and fringe political conspiracies based on watch-time alone.While this represents an important journalistic investigation, there are sparse details available into the technical implementation of their approach.They briefly mention analyzing the advertisements they were delivered so it is possible they are looking at the mobile app, but there isn't an accompanying technical document.At the moment, while their investigation offers evidence of problematic rabbit holes as a result of the TikTok recommendation algorithm, it's difficult to replicate their findings.

PROPOSED APPROACH
We propose a technique for auditing the TikTok mobile app through emulating a series of Android devices, installing a version of TikTok without certificate pinning, and intercepting the network traffic of the Android device to analyze the metadata of the videos.Through this technique, we are able to answer novel questions such as the role that certain demographic information plays in TikTok's recommendation system, as well as study the ad delivery ecosystem on the platform.

METHODOLOGY 5.1 Technique to Audit the TikTok Mobile App
With the goal of auditing the TikTok app, we need some way of running TikTok on devices that allow us to log in and watch videos.To do this, we employ Android device emulators running through Android Studio.This allows us to simulate many devices at once and control them using our computer.We use the Android Studio CLI [9] to launch the devices, and Android Debug Bridge (adb) [10] to connect to our devices and install the TikTok app.We also use UIAutomator2 [8] to interface with the devices, allowing us to swipe, click, and otherwise interface with the device programmatically.
If we were only interested in the content on the device's screen, we would be able to create an emulator, download the TikTok app off of the Google Play store, and control the interactions with the account that way.However, there is a lot of metadata that isn't made available to the user that is nonetheless important for our analysis.For example, TikTok does not reveal the video ID of a post on TikTok; we use the video ID in Section 5.2 as a unique signifier of the videos delivered to a given user, allowing us to measure similarities between two TikTok feeds.Additionally, there are some other elements of the metadata that may prove useful or interesting to analyze, such as the play count of the video or the "Suggested Words", a list of related words or phrases separate from the hashtags and description that TikTok generates for many videos.As such, we need some way to collect all metadata of a given video, not just that which appears on the screen, e.g.like counts, comment counts, author names, and other surface level metadata.To accomplish this, we need to intercept the network traffic of the TikTok app for each device.Once we do that, we can parse the network traffic for the video IDs and use those video IDs to gather the metadata of the video from the HTML of the video's TikTok webpage.
Intercepting the network traffic of TikTok involves overcoming several challenges.First, TikTok implements SSL pinning to protect app data from the end user being able to intercept and read it, which is what we are seeking to do.To that aim we need to unpin the TikTok SSL certificate, which we can do by patching a TikTok APK [7].Now that we have the ability to intercept the network traffic, we need to form a Man-in-the-Middle to collect the network traffic to later be analyzed.To do this, we use Mitmproxy to intercept the web traffic of each device [3].Mitmproxy allows us to intercept the network traffic for each of our devices, provided they're running on different HTTP proxies.We can configure the devices to run on an HTTP proxy through the Android Studio CLI, and we can specify a separate Mitmproxy instance to listen each of the HTTP proxies of our devices.Once we are listening to the network traffic of each device, we can save that traffic to later process.
Our final challenge is that much of the information we're interested in is only available in the metadata field of the video's TikTok webpage.We need to now find the URL of the video so that we can scrape the metadata hidden in the HTML of the corresponding Tik-Tok page.To get around this challenge, we can parse the network data to find the URL of the video, including the video ID.We can then use this URL to hit the corresponding API from the TikTok website that gives us the HTML data.In that HTML data is a script with a field which contains the metadata we're interested in.Finally, we can use the data from this field to process the metadata of each video that the TikTok bot sees.

Investigating Role of Demographics in Personalization
Using the techniques discussed in Section 5.1, we perform a preliminary audit to understand the role that the user-declared demographics of gender and age play in the delivery of content on TikTok.While age is required to be entered by users when creating their TikTok account, gender is an optional demographic that can be entered in the "How your ads are personalized" page in the Ads section of the settings menu.
For experiments investigating both age and gender, we followed the techniques in Section 5.1 to create TikTok emulators, create bot accounts, and scroll through their TikTok feeds.For the gender experiment, similar to techniques discussed in [6], we run six devices total: two devices with self declared genders of Man and Woman each for a total of four devices, and two additional devices without gender specified.By comparing the similarities between the two devices that do not have a specified gender, we can establish a baseline of noise that we would expect devices to have should gender not play a role in the delivery of content.If it is the case that the declared gender influences the recommendation algorithm, we would expect the videos delivered to the devices with the same gender to be very similar to each other, and be more similar to each other than the two devices without declared gender.
We perform a similar process for the age experiment; we create six devices, two declared as 13-years-old (also referred to as "child" accounts), two as 18-years-old, and two as 30-years-old (collectively referred to as "adult" accounts).We ran this experiment twice.In the first run, one of the 30-year-old devices malfunctioned, so the main analysis of the feeds of videos will use the second run.The malfunctioning of one devices did not have any effect on the other devices but this prevented us from comparing the similarities of the feeds of the two 30-year-old devices.The first run was delivered more ads, so we will use it for analyzing ad delivery.We chose the youngest at 13 because that age is the youngest age where TikTok doesn't deliver a heavily curated feed of family friendly content, and the youngest age of users that can be targeted by advertisements.We chose the other devices as 18 and 30 because 18-year-olds are considered adults in the United States, and 30 is far enough in age away from 18 for us to expect to see differences while still being in the age group of people who use TikTok frequently.

RESULTS
Figure 1 presents our findings for the experiment measuring the degree of personalization based on user-declared gender (Man, Woman and, if not delcared, No Gender).This graph shows how the Jaccard Index, a measure of similarity between two sets, changes over time.It's cumulative, so the first data point measures the Jaccard Index of the first 50 videos, while the second measures the Jaccard Index of the first 100 videos.This is done because we want to see how the index changes over time, but if the algorithm shows the same video to one user at one point and the other user at a different point, we would want those two videos to still count as a match.
Figure 1: Cumulative Jaccard index of our gender experiment, measured every 50 videos.We don't see any clear patterns emerge here.We see the overall similarity increase at a fairly steady pace across all devices.We don't see a high degree of similarity between the two Man devices, with a similarity level lower than our two No Gender control devices.
If gender did play a role in the algorithmic personalization of the feeds, we would expect the devices with congruent gender to be most similar to each other, and the devices without gender declared to be very dissimilar to the devices with their gender declared.We see in Figure 1 that when comparing the similarities of feeds between a device with gender declared and a device without gender declared (the blue lines), the cumulative Jaccard index over time varies greatly.Some devices start very similar and continue to be very similar, while others are not very similar.Interestingly, when we compare the Jaccard Index of the two Woman devices, we see that they are very similar, but so is a pairing with one Woman device and one Man device.The similarity between the two Man devices is very low, lower than the similarity between the two No Gender devices.Overall, we don't find there to be any evidence of algorithmic personalization based on user-declared gender.
When looking at the advertisement personalization, we also don't find any evidence of algorithmic personalization.We found this surprising, and it's possible that the quantity of ads we collected through our experiments (around 100 ads per device for this experiment) is not sufficient to show trends.We leave further investigation of the algorithmic personalization of advertisements based on user-declared gender to future work.
For investigating the role that a user's age plays we ran two experiments, each with six devices and with fresh accounts, although in the first experiment one of the age-30 devices malfunctioned.For the delivery of regular videos (i.e.not advertisements), we will analyze the data using the second experiment, where all six devices were able to run.However, the devices in the second experiment were delivered fewer ads, so when analyzing advertising delivery we will analyze the data from the first experiment with five devices.
The results of our age experiment can be found in Figure 2. In contrast to the gender results, here we find evidence that there is algorithmic personalization with regards to age.When looking at the similarity in feeds between devices that are the same age, we find their feeds to be very similar.We also find the feeds of two devices that are adults (18-and 30-year olds) to be very similar, albeit less similar than the adult devices of congruent age.However, Figure 2: Cumulative Jaccard Index of our second age experiment, measured every 50 videos.We see that devices with the same age are delivered feeds very similar to each other.We see that the combinations of the adult devices are also very congruent.We don't see similar results for a child device and an adult device, with all combinations of a 13-year-old device and an 18-or 30-year-old having much lower Jaccard Index across the experiment.
when comparing the feeds of one device that is 13-years-old and another that is an adult (either 18 or 30), we see their feeds to be vastly less similar.This provides evidence that there is algorithmic personalization based on age, where users who are the same or similar ages see more similar content, but the devices of children see very different feeds than that of adults.We argue that this is a promising finding, as it's likely that much of the content delivered to adults would be inappropriate for children, but more work is needed to understand what sorts of content is delivered to children and whether any inappropriate content is delivered to them.
When looking at advertisements delivered to our devices during the first experiment (Figure 3), we find very similar results.We see that the advertisements delivered to congruent devices (two 13-year-old and two-18-year-old) or two adult devices (one 18-and one 30-year-old device) had a much higher similarity than that of one adult and one child device.This could be due to the TikTok ad platform delivery system, where advertisers select which groups to send their ads and may deliberately choose whether to send ads to children.

CONCLUSIONS AND FUTURE WORK
In this work we provide a comprehensive technique for performing audits of mobile platforms.We illustrate our technique for auditing TikTok using Android emulators and intercepting network traffic to learn more about each TikTok video.We then use this technique to investigate the role that user-provided demographic information has on the recommended videos and ads on a user's For You page.We find that gender doesn't seem to influence the delivery of videos or ads.We also find that age seems to matter a lot, especially with regards to whether a user is under the age of 18-our devices with a set age of 13 had feeds very similar to each other and very different to the other accounts of users over the age of 18.We also saw few advertisements shared between 13-year-old devices and the Figure 3: Cumulative Jaccard Index for the Advertisements delivered during our first Age experiment, measured every five videos.We see that the Jaccard Indices between one adult device and one child device are much lower than those of two 18-or two 13-year old devices and two devices that are both adults.devices of adults, signifying that there may be a separate set of advertisements shown to children.
Taken together, our methodology provides a framework for the continued auditing of mobile apps, including TikTok.We plan to use these techniques to perform more sophisticated audits involving more devices and more complicated bot behavior.One such experiment involves performing an audit similar to Boeker and Urman [4] but with more complex user interests and scraping the data from each bot for a longer period of time.In this way, we hope to see how the recommendation algorithm adapts and suggests content based on more unique user interests over a longer period of time on the mobile app, and whether the advertising engine begins recommending more personalized ad content to these users based on their interests.