Linguistic Alignments: Detecting Similarities in Language Use in Written Communication

Human language has many functions. Our communication on social media carries information about how we relate to ourselves and others, that is our identity, and we adjust our language to become more similar to our community - in the same way as we dress and style and act to show our commitment to the groups we belong to. Within a community, members adopt the community's language, and the common language becomes a unifying factor. In this paper, we explore the possibilities of identifying linguistic alignment - that individuals adjust their language to become more similar to their conversation partners in a community. We use machine learning to detect linguistic alignment to a number of different ideologies, communities, and subcultures. We use two different approaches: transfer learning with RoBERTa and traditional machine learning using Random forest and feature selection.


I. INTRODUCTION
In both face-to-face and online interactions, individuals tend to subconsciously and subtly adjust their language to become more similar to their conversation partners [7].This linguistic accommodation not only enhances the efficiency of communication -it also creates a positive social identity, fostering a sense of sympathy and belonging -and indication identification with the group of individuals we are interacting with [20].The motivation behind linguistic accommodation partly stems from a desire to express one's identification with a group, including a gain of social approval.The magnitude of identification would lead individuals to adapt more and embrace the linguistic style of the specific group.Linguistic innovation, such as creating new words or using existing words in novel ways, is recognized as a means to form subcultures [5] but also as a tool to express the group's unique identity and express group-specific matters more precisely.When an individual joins a community, their level of adaptation to the community's linguistic norms reflects their aspiration to fit in and the extent of their identity seeking.As new members, they can either adapt to existing community norms.Members who have been in the community for a long time can either adapt to the new norms or stick to their previous styles and be innovators who transform the community [6].Research suggests that the more often people talk (or write) to each other, the more similar their speech becomes, meaning that users who communicate in the same community become linguistically closer over time [3].Research also found women to accommodate the general structure of an online community to a greater extent than men [14].
In this work, we explore the possibilities of using machine learning to determine linguistic alignment with a set of online ideologies, communities, and subcultures.We have built classification models for determining linguistic alignment with seven different ideologies/subcultures/communities using data collected from online communication.The ideologies/subcultures/communities we consider are: • Counter-jihad -a movement that considers Muslims living within Western boundaries a potential threat to Western society and culture.• White supremacy -a belief that white people are superior to those of other races and thus should dominate them.• Alt-right (alternative right) -an online phenomenon that can be described as a loosely connected far-right white nationalist movement.• Animal rights -a movement promoting the idea that animals should be free to live without being used, exploited, or otherwise interfered with by humans.• Environmentalism/environmental rightsa movement aiming to protect natural resources and ecosystems.• Incel -an online subculture consisting of men on incel forums that blame women and society for their lack of romantic success.Further, several attempts have been made to detect jihadist propaganda or promoters of jihadist ideologies [2], [10], [18].Most attempts have used Twitter data and various forms of machine learning to classify Twitter accounts as pro-IS or normal.However, concerns have been raised regarding the quality of the data and the data collection methods since the methods used are prone to sampling biases, and the datasets are not sufficiently filtered or validated [12].Other work has examined methods for detecting incel communication [8].

II. LINGUISTIC ALIGNMENTS
The data (see Table I) we have used to train our classifiers is limited to certain groups, sources, or forums and does not cover an entire ideology or subculture.
• White Supremacy (WS): we used data from Stormfront and VNN Forum.Stormfront is presented as a commu-nity of "racial realists, idealists and white nationalists".The Vanguard National News Forum (VNN Forum) was launched in late 2001 as an uncensored forum for "white" people [9].• Counter-jihad: we used data from Gates of Viennam, a website affiliated with the counter-jihad movement, featuring contributions from multiple writers.Gates of Vienna covers various aspects of the counter-jihad movement's historical evolution and offers information about European counter-jihad conferences.• Alt-right: we used data from the Daily Stormer -one of the most notorious websites for the alt-right movement, established in 2013.The website gained attention when derogatory remarks about a woman who was killed in connection with the Unite the Right rally in Charlottesville during August 2017 was published.
• Incel: we have used data from six different incel forums: Incels, Blackpill club, Non-cucks united, Lookmaxxing forum, Looks theory, and Yournotalone.All these forums are dedicated meeting places for incels.Incels have developed their own characteristic language, their own areas of interest, and speculative theories that strengthen their members' cohesion and sense of belonging [13].
• Animal rights: we used data from Animal Liberation Front (ALF) and two subreddits.The ALF is an international group focused on animal rights with a website where recommendations for reading and information about actions are shared and commented on.The subreddits that we have used are r/AnimalRights/ and r/AnimalRebellion.• Environmentalism: for environmentalism (environmental rights), we have used data from three different subreddits: r/environment, r/extinctrebellion, r/ClimateOffensive.We have also used data Earth Liberation Fronts (ELF) website.ELF is organisation that evokes direct action and revolutionary violence relying on a leaderless resistance model of operations [17].
• Jihadist: we have used the IS-produced magazines Dabiq and Rumiyah aimed specifically at the West [4].We have also used the IS magazines Al-Hayat IS Report and Al-Hayat IS News and Al-Qaeda in the Arabian Peninsula's English language publication Inspire.Another source is the Al-Risalah an English-language propaganda magazine published by Jabhat al Nusra and extracts from IS supporter blogs.

III. METHOD
The data we have used for training each classifier is listed in Table I.For all forums and subreddits we collected data from users who have posted more than 20 and less than 10 000 posts in English.To determine the language of a post, the Python version of the library langdetect [15] was used.All posts from a user are merged into one text.For the magazines (Al-Risalah, Dabiq, Rumiyha, Inspire, Al-Hayat IS Report, and Al-Hayat IS News), we divided each magazine into articles or pages.The The data used as normal population is listed in Table I normal population in Table I consists of 300 randomly selected users from three discussion forums and a set of blogs.Before training the classifiers, the data was cleaned.Each character was converted to lowercase.For each linguistic alignment, term frequency-inverse document frequency (TF-IDF) was calculated, and the 1000 terms with the highest TF-IDF score were selected.The 500 most frequent word and bi-collocation words having a frequency of more than 50 were also extracted.All words (extracted using TF-IDF, most frequent, and bi-collocation) were manually an- While building the TF-IDF vocabulary features, words that appear in more than 20% of the documents and words that appear in less than 0.1% of the documents were removed.This process eliminates the most common words and words that seldom appear in the corpus.Table III shows some examples of features for each linguistic alignment.We trained two models: a RoBERTa model, and a Random forest model.Robustly Optimized BERT Pretraining Approach (RoBERTa) is a language model based on transformer architecture [19].We utilized a pre-trained RoBERTa model made available from the transformers library Hugging face 1 and finetuned it with our datasets.Since most of the posts used for the experiment are longer text, the maximum sequence of tokens was fixed to 512 tokens.The experiment was done with five epochs, and the batch size was 8.During the training process, we chose the best-performing model measured by accuracy on the validation set.In the case with RoBERTa, Adam optimizer was used with a small learning rate of 5e-6.
To build the Random forest (RF) model, we used a bag of words model with manually selected features and the classification algorithm Random forest.When training the model, hyper-parameter tuning was done using grid search to estimate the optimal parameters of the classifier.The data was divided into 80% training and 20% test.
For each subculture/ideology, two different classes were created: a positive class and a negative class.The positive class contains data that represents the subculture/ideology, i.e., communication from a digital environment that is produced by members or promotors of the subculture/ideology.The data used for the negative class for each classifier is selected to represent subcultures/ideologies that it is very unlikely that an individual that belongs to the positive class is inspired by.For example, it is unlikely that an individual whose writing is aligned towards Alt-right also is aligned towards Animal rights and Environmentalism.However, an individual whose writing is aligned toward Alt-right might also be aligned toward White supremacy or Counter-jihad.The negative and the positive classes used to train each linguistic alignment classifier are presented in Table II.The normal population is described in Table I IV.RESULT After training the Random forest model on 80% of the data, the model was tested on the resulting 20%.For the Random forest model -White supremacy, Alt-right, and Environmentalism had the highest F1-score.
When training the RoBERTa model, the data was divided into 70% train, 10% validation, and 20% test.The RoBERTa models had close to perfect F1-scores.The results are shown in Table IV.As seen, RoBERTa perform better than the Random forest.

V. TESTING IN THE WILD
To test our linguistic alignment models in a more realistic scenario, we have tested the models on a set of texts that either are transcripts of speeches or written texts.The texts relate to various ideologies, communities, and subcultures we have trained our linguistic alignment classifiers to recognize.Table V  The RoBERTa White supremacy model classified the texts by John Earnest and Anders Breivik as white supremacy.The Random forest model classified the texts by John Earnest, Anders Breivik, Brenton Tarrant, Peyton Gendron, and Dylan Roof as white supremacy.
Both models classified the GP Animal Protection Manifesto and the Labours Animal Welfare Manifesto as animal rights and Greta Thunberg's speech as environmentalism and GP Animal Protection Manifesto and the Labours Animal Welfare Manifesto as environmental.The RoBERTa animal rights model also (incorrectly) classified Greta Thunberg's speech and Elliot Rodger's YouTube transcript as animal rights.
The RoBERTa incel model correctly classified Jake Davison's posts as incel but did not succeed in classifying Elliot Rodger's YouTube transcript as incel.The Random forest incel model correctly classified Elliot Rodger's YouTube transcript as incel but did not classify Jake Davison's posts as incel.The Random forest incel model incorrectly classified the texts written by John Earnest, Brenton Tarrant, Peyton Gendron, and Dylan Roof as incel (above 0.5 probability).
In summary, the results show that the performance of the models seems to differ when applied to new data.

VI. DISCUSSION
Transfer learning using RoBERTa provided significantly better classification results than the model with Random forest and feature selection.However, the Random Forest model surprisingly worked better when applying the models to new unseen texts.
The performance of the different models differed depending on the different linguistic alignments.In the case study, the Random forest models worked much better on the right-wing alignments (counter-jihad, alt-right, and white supremacy) than the RoBERTa models.The RoBERTa models for Counterjihad, Alt-right, and White supremacy did not perform well when it comes to classifying right-wing texts.The Random forest models could correctly classify five of the right-wing texts, while the RoBERTa models only managed to classify two texts (partially) correctly.However, it is important to note that the case study is small, and more data is needed to draw any conclusions from the results.
The RoBERTa incel model performed much better than the Random forest incel model.The Random forest incel model incorrectly classified four texts as incel, correctly identified one text as incel, and missed one text that should have been classified as incel.The RoBERTa incel model, on the other hand, missed one text and correctly classified one text as incel.
The small case study that we did only focused on longer texts.It would be interesting to use the models in a real scenario with shorter texts and get an understanding of the different model's performances on shorter texts.One of the challenges when using RoBERTa models is the limitation of the size of the texts that are classified.A RoBERTa model has a maximum token length of 512 tokens.The most common way to adhere to the limitations in text size is to only use the first 512 tokens, which is a sufficient option in many cases.Another option would be to split the text into multiple subtexts, classify each subtext and combine the results back together (for example, by choosing the class which was predicted for most of the subtexts).This latter option is more resourceconsuming as all 512 token chunks in a long text must be classified.

VII. CONCLUSION AND DIRECTIONS FOR FUTURE WORK
We have examined the possibility of building classification models that can be used to determine linguistic alignment shows the result of two different classifications where R is the RoBERTa model and RF is the Random forest model.When classification results are what we expected, they are in bold font.The RoBERTa Counter-jihad model and the RoBERTa Altright model classified none of the text as Counter-jihad or Alt-right.The Random forest Counter-jihad model classified Brevik's text as Counter-jihad, and the Random forest Alt-right model classified the texts written by John Earnest, Brenton Tarrant, Peyton Gendron, and Dylan Roof as Alt-right (above 0.5 probability).

•
Jihadist -an ideology promoted by terrorist groups such as the so-called Islamic State or Al-Qaida.

TABLE II THE
DATA USED FOR THE POSITIVE AND THE NEGATIVE CLASS OF EACH LINGUISTIC ALIGNMENT MODEL.

TABLE III EXAMPLES
OF FEATURES FOR EACH LINGUISTIC ALIGNMENT.

TABLE IV RESULTS
FOR THE CLASSIFICATION OF LINGUISTIC ALIGNMENT USING RANDOM FOREST AND ROBERTA.