Same or Different? Diff-Vectors for Authorship Analysis

We investigate the effects on authorship identification tasks of a fundamental shift in how to conceive the vectorial representations of documents that are given as input to a supervised learner. In ``classic'' authorship analysis a feature vector represents a document, the value of a feature represents (an increasing function of) the relative frequency of the feature in the document, and the class label represents the author of the document. We instead investigate the situation in which a feature vector represents an unordered pair of documents, the value of a feature represents the absolute difference in the relative frequencies (or increasing functions thereof) of the feature in the two documents, and the class label indicates whether the two documents are from the same author or not. This latter (learner-independent) type of representation has been occasionally used before, but has never been studied systematically. We argue that it is advantageous, and that in some cases (e.g., authorship verification) it provides a much larger quantity of information to the training process than the standard representation. The experiments that we carry out on several publicly available datasets (among which one that we here make available for the first time) show that feature vectors representing pairs of documents (that we here call Diff-Vectors) bring about systematic improvements in the effectiveness of authorship identification tasks, and especially so when training data are scarce (as it is often the case in real-life authorship identification scenarios). Our experiments tackle same-author verification, authorship verification, and closed-set authorship attribution; while DVs are naturally geared for solving the 1st, we also provide two novel methods for solving the 2nd and 3rd that use a solver for the 1st as a building block.


INTRODUCTION
Recent years have seen an increased interest in automated authorship analysis, a set of tasks aiming to infer the characteristics of the author of a text of unknown or disputed paternity.Authorship analysis is concerned with inferring characteristics such as the gender [Koppel et al. 2002], the age group [Gollub et al. 2013], or the native language [Tetreault et al. 2012] of the author, among others; these subtasks usually go under the name of author profiling [Argamon et al. 2009].Alternatively, authorship analysis may be concerned with inferring the identity of the author; tasks in which this is the goal are collectively referred to as authorship identification tasks, and include authorship verification (AV -the task of predicting whether a given author is or not the author of a given anonymous text [Stamatatos 2016]), authorship attribution (AA -the task of predicting who among a given set of candidates is the most likely author of a given anonymous text [Juola 2006; Koppel  et al. 2009; Stamatatos 2009]), and same-author verification (SAV -the task of predicting whether two given documents are by the same, possibly unknown, author or not [Koppel and Winter 2014]).Authorship analysis has several applications, e.g., in supporting the work of philologists who try to identify the authors of texts of literary or historical value [Benedetto et al. 2013; Corbara et al.  2019; Kabala 2020; Kestemont et al. 2015; Mosteller and Wallace 1964; Savoy 2019; Tuccinardi 2017], or in aiding linguistic forensics experts in crime prevention or criminal investigation [Chaski 2005;  Larner 2014; Rocha et al. 2017].
All of these tasks are usually approached as text classification tasks, whereby a supervised machine learning algorithm, using a set of labelled documents, is used to train a classifier to perform the required prediction task.As in many supervised learning endeavours, each training example is usually represented as a vector of features, where the value of a feature in a vector usually corresponds to the relative frequency with which a certain linguistic phenomenon (say, an exclamation mark, or a POS-gram) occurs within the document.
In this paper we carry out an in-depth analysis of an alternative method for generating vectorial representations of texts for authorship identification.Specifically, while in the standard representation methodology a vector represents a document, in this alternative method a vector represents an unordered pair of different documents.While in the standard methodology the value of a feature is (an increasing function of) the relative frequency of occurrence of a given linguistic phenomenon in the document, in this alternative method it is the absolute value of the difference between the relative frequencies (or increasing functions thereof) of this phenomenon in the two documents.Since these vectors represent differences, we call these representations Diff-Vectors (DVs).While in the standard methodology the class label is the author of the document, in this DV-based methodology the class label is one of the two classes Same or Different (standing for "same author" or "different authors", respectively).
Technically, this latter type of representation is not novel, since it was first described (to the best of our knowledge) by Koppel and Winter [2014].However, curiously enough, the goal of [Koppel  and Winter 2014] was to propose a different method (the "impostors" method for SAV), and its authors mention the DV-based representation only to dismiss it as a "simplistic baseline method" [Koppel and Winter 2014, p. 179].Since then, the use of DVs has never been studied systematically; to carry out such a systematic study is the goal of the present paper.
We carry out extensive experiments on a number of publicly available datasets (among which one that we here make available for the first time) representative of different textual genres, lengths, and styles.In these experiments we tackle different authorship identification tasks, including SAV (for which DVs are naturally geared), AA, and AV; for these two latter tasks we propose two new methods, Lazy AA and Stacked AA (two AA methods that can also be used for AV) that solve AA by using a DV-based SAV classifier as a building block.Our experiments show that the DV-based representation is advantageous, since it brings about substantially increased effectiveness at the price of a tolerable increase in computational cost.The experiments also show that DVs bring about substantial improvements especially in low-resource authorship analysis tasks, i.e., in tasks characterised by small quantities of training data (which is the case in many real-life authorship analysis scenarios, such as those dealing with ancient texts).Like the standard representation, the DV-based representation is learner-independent, i.e., it can be used in connection with any (supervised or unsupervised) learning method.
The rest of the paper is structured as follows.In Section 2 we formally describe DVs and justify why they look like a superior means of representing authorship-related information.In Section 3 we describe algorithms for casting authorship identification tasks (such as AV or AA) in terms of SAV (the task that DVs are naturally designed for).Section 4 reports the results of our experiments; in particular, Section 4.4 discusses our "intrinsic" evaluation of DVs, i.e., one in terms of sameauthor verification, while Section 4.5 discusses an "extrinsic" evaluation of DVs, i.e., one in terms of downstream tasks such as AV and AA.Section 5 discusses related work, while Section 6 wraps up, also pointing at avenues for further research.

Authorship identification tasks
We assume a finite set A of authors (where A will be often called the codeframe) and a domain D of documents.For each document   ∈ D we indicate by   ∈ A the true author of   .We also assume the existence of a training set L = {(  ,   )}  =1 of documents of known paternity.We define authorship verification (AV) as the task of predicting, given a document   and a candidate author  * ∈ A = { 1 , . . .,   }, whether  * is the author of   or not, where the labels  1 , . . .,   of the training documents are in A = { 1 , . . .,   }, with  ≥ 2. 2 We define (closed-set) authorship attribution (AA) as the task of predicting, given a document   and  candidate authors A = { 1 , . . .,   }, (one of whom is assumed to be the author of   ), who among the members of A is the author of   , where the labels of the training documents are in A = { 1 , . . .,   }, with  ≥ 2. 3 We define same-author verification (SAV) as the task of predicting, given two unlabelled documents   and   , if they are by the same author or not, where the labels of the training documents are in A = { 1 , . . .,   }, with  ≥ 2. This task admits two different variants, i.e., (i) closed-set SAV, which corresponds to the setup in which the authors of the unlabelled documents are assumed to be in A, and (ii) open-set SAV, where the authors of the unlabelled documents are not necessarily in A.
Note that terminology is somehow variable across the authorship analysis literature, and some of the above tasks may be defined slightly differently in other works.For instance, authorship verification is sometimes defined (see e.g., [Kestemont et al. 2021]) as the task of predicting whether, given a document   and one or more documents known to be by a candidate author  * , also   is by  * .In this latter definition authorship verification shares some characteristics with "our" AV (in the fact that a candidate author  * for document   is considered) and with "our" SAV (in the fact that we check whether   is by  * by testing if   is by the same author as other texts known to be by  * ).Our definition of AV and SAV are, we think, cleaner, since they clearly separate (i) the task of predicting whether a document   is by a candidate author  * , from (ii) the task of predicting whether a document   is by the same author as some other document.Our definitions are also more general, since "our" SAV does not assume the author of one of the two documents to be known.
* .Alternatively, AV can be formulated as a problem in which  = 2 and A = { * ,  * }, in which class  * collectively represents the production of authors other than  * .This special case will be discussed more in detail in Section 4.7. 3 In real cases we may not be certain that the author of   is indeed in A = { 1 , . . .,   }; in these cases, closed-set AA amounts to indicating who, among the authors in A = { 1 , . . .,   }, is the most likely author of   .

Diff-Vectors
In "standard" authorship identification, each document   is represented via a labelled vector x  of features, where each feature usually represents a linguistic phenomenon that may occur (possibly several times) in a document of D, the label   ∈ A represents the true author of   , and the value x   of the -th feature in vector x  represents a non-decreasing function (e.g., tfidf) of the relative frequency of the linguistic phenomenon in   .For instance, if the -th feature stands for character 3-gram "car", then the value of x   may be the number of occurrences of character 3-gram "car" in   divided by the number of all character 3-grams that   contains.
We here study an alternative type of vectorial representation for authorship identification tasks.Here, a labelled vector x   represents an unordered pair (  ,   ) of documents in D such that  ≠ , each feature represents a linguistic phenomenon that may occur (possibly several times) in a document of D, the label    ∈ P = {Same, Different} indicates whether the true authors of   and   are the same person or not, and the value x    of the -th feature in vector x   represents the absolute difference between non-decreasing functions of the relative frequencies of the linguistic phenomenon in   and   .(In this section we provisionally assume this function to be the identity function  () = , while in the sections to come this function will be some well-established feature weighting function.)Since the difference between relative frequencies is central to the definition of these vectors, we call them Diff-Vectors (DVs).
If we have chosen our features well, i.e., if the frequencies of occurrence of the corresponding linguistic phenomena are indeed indicative of authorship, when two documents have been written by the same author the values x    of these features will be low, since the above frequencies will be similar in the two documents.In other words, DVs belonging to class Same will tend to be characterised by low feature values and low norms, while vectors belonging to class Different will tend to be characterised by high feature values and high norms.The quintessential example of a DV likely to be in class Same is the vector of all 0's, since the fact that for all features the frequency of occurrence of the feature in the two documents is identical, is highly indicative of the fact that (if the features have been chosen well) the two authors are the same person.Conversely, the quintessential example of a DV likely to be in class Different is (if feature values are all normalised) a vector of all 1's, since it represents two documents with maximally different frequencies of occurrences for all features.All DVs fall, if normalised, in the unit hypercube.
More in general, if a DV belongs to class Same, DVs that lie between it and the vector of all 0's will also tend (if we have chosen our features well) to belong to Same.As a result, the region that contains the DVs belonging to Same will tend to be the portion falling in the non-negative orthant of a star-convex region centred at the origin of the axis. 4In particular, if Same and Different are linearly separable, and if  is the dimensionality of the feature space, the region that contains all the DVs belonging to Same will tend to be (see Figure 1) a -simplex (in  = 3 dimensions: a tetrahedron) with an orthogonal corner, and the separating surface will tend to be a ( − 1)-simplex (in  = 3 dimensions: a triangle). 5ny set of labelled documents L = {( 1 ,  1 ), . . ., (  ,   )} can be represented either in the standard way or via DVs.One of the main differences between the two representations is that the "standard" representation gives rise to  labelled vectors, while the alternative representation gives rise to ( − 1)/2 labelled vectors.The other main difference is that a classifier using the "standard" representation attempts to predict, given an unlabelled document, its true author, while a classifier using the DV-based representation attempts to predict, given two unlabelled documents, whether the two documents are or not by the same author.In other words, the standard representation is geared towards AV or AA, while the DV-based representation is geared towards SAV.However, AV and AA can (as discussed below) be recast in terms of SAV, and vice-versa; as a result, we will consider the two representations as general-purpose alternatives, and we will study them as such.

Diff-Vectors result in more training examples for AV
Our working hypothesis is that the DV-based representation is advantageous.In order to show this, let us consider AV, and let us assume that  * ∈ A is our candidate author.When using the standard representation, we typically replace each label in A \ { * } with label  * (to indicate the complement of  * ) and train a binary classifier that discriminates between  * and  * .However, in doing so a lot of information is lost, namely, the information whether two training examples in  * are by the same author or not.For authorship-related tasks this is valuable information, which the standard representation wastes and the DV-based representation does not.The following example shows that the information wasted by the standard representation is, indeed, a lot.(1) 10 Note that the information provided to the training process by the examples of Type 1a is also provided (albeit in a different form) when using the standard representation, since with the latter the learner is implicitly told that the two documents are from the same author.The same happens for the examples of Type 2a, since with the standard representation the learner is implicitly told that the two documents are from different authors.However, the key observation here is that the examples of Type 1b and Type 2b provide information that is instead lost when using the standard representation, since the standard representation only tells the learner that the two documents are not by  * , but does not tell the learner if they are by the same author or not.In sum, 404,550 out of 499,500 training examples, i.e., about 81% of the entire set, provide information that was not provided by the standard representation; in other words, in this case the learner receives more than 5 times the amount of information than the standard representation provides to it.□ More in general, if we have  authors and  = / training examples per author, the number of DVs that do not provide additional information with respect to the standard representation is i.e., the number of pairs of Type 1a plus the number of pairs of Type 2a, while the number of DVs that do provide additional information is i.e., the number of pairs of Type 1b plus the number of pairs of Type 2b.Note that, while the amount of information that was already available to the learning process is  ( 2 ) (Equation 1), the new information made available to it is  ( 2  2 ) (Equation 2).The latter amount of information can be extremely valuable, especially since it comes at no cost, and especially in application scenarios (as there are many in authorship identification) characterised by the scarcity of training data.Among all of the above, ( − 1) 2 are examples of Same, which are  ( 2 ), while are examples of Different, which are  ( 2  2 ).
In sum, when our task is AV, if we switch from standard representations to DV-based representations, we end up with a much higher quantity of training data, since DV-based representations exploit information that standard representations waste.Hovewer, note that switching from standard representations to DV-based representations means switching (as noted at the end of Section 2.2) from vectors geared towards AV to vectors geared towards SAV.This suggests the idea to use these vectors to train a high-performance SAV classifier, and then to devise an algorithm that can perform AV on top of this SAV classifier; this is the goal we will pursue in Section 3.3.

Diff-Vectors make training more robust in closed-set AA
The fact that more information is provided to the training process holds for AV, but does not necessarily hold for other authorship identification tasks.In general, this fact only holds for tasks in which, as in AV, the training documents by different authors end up being grouped together into a single class; this happened in AV with the  * class.However, that more information is provided to the training process does not hold when the above-mentioned grouping does not happen, as, e.g., in closed-set AA.In the latter task, the information conveyed to the training process by a DV with label Same is obviously also implicitly conveyed when using the standard representation (where the vectors corresponding to the two documents are labelled with the same author), and the same holds for DVs with label Different (where the two vectors are labelled with different authors).
So, in closed-set AA (and in the latter tasks in general) it would appear that there is no advantage in using DVs.This is actually not true, because the advantage is in the fact that, when using DVs, all the training information is concentrated on labelling just two classes, i.e., Same and Different, while in the classical representation this information is spread out thin, i.In sum, the use of the DV-based representation in closed-set AA allows the SAV binary classifier to be trained robustly, thanks to the fact that the existing amount of training information can be devoted to solving a comparatively easier binary classification task rather than a comparatively more difficult 1-of- classification task.We can thus expect to obtain accurate SAV classification predictions; in Section 3.2 we will see that these SAV predictions can also be used by a downstream process to solve authorship identification tasks such as AV and AA.

SOLVING SAV, AA, AND AV, BY MEANS OF DIFF-VECTORS
One difference between the standard representation, in which class labels represent authors, and the representation based on DVs, in which class labels are in {Same,Different}, is that the tasks that can be solved "directly" are AV and AA for the former, and SAV for the latter.That is, by using the standard representation, AV and AA can be solved directly by setting up a classifier that, for a given document, returns a class label in A (for AA) or in { * ,  * } (for AV); SAV is instead to be solved as a derivative, "downstream" task, e.g., by first determining the true authors of documents   and   by means of two calls to an AA engine, and then checking whether the two returned class labels are the same or not. 6On the contrary, when using the DV-based representation, SAV is solved directly; AV and AA are instead to be solved as derivative tasks, using SAV as the building block of any algorithm for solving them.In this section we first formally define our method for performing SAV (Section 3.1), and then go on to describe two alternative solutions for solving both AV and AA (Section 3.2) that build on top of the former. :8 Corbara, Moreo, Sebastiani

Solving SAV by means of Diff-Vectors
Given a training set L = {( 1 ,  1 ), ..., (  ,   )} of documents   ∈ D labelled by classes   ∈ A = { 1 , ...,   } representing authors, we define its pair-based version as where SD(  ,   ) is an indicator function that returns Same if   =   and Different otherwise.We also assume a feature extractor  : D → R  which maps documents  ∈ D into -dimensional vectors x of real numbers.We can thus rewrite L as {(x 1 ,  1 ), ..., (x  ,   )} and redefine L P as where x   ∈ R  is a vector of absolute differences of feature values, i.e., x   is the vector such that its -th component is , the pair-based version L P is ( − 1)/2 times larger than its standard counterpart L. In practice, the size of L P can be so large as to make the learning process intractable for some batch learners.For example, the 499,500 training DVs of Example 2.1 would result from a dataset of 10 authors and 100 training documents per author, which is not a terribly large dataset.As shown in Section 2.4, L P tends to be imbalanced, with a Same / Different training example ratio close, assuming a training set containing the same number of documents for each author, to 1/.
In practice, we will be interested in generating and using only a subset L ′ P ⊂ L P : by including in L ′ P a small enough number of elements of L P we can make the training process tractable, and by including in L ′ P an equal number of examples of Same and Different we can avoid the typical negative consequences of imbalance.By using a subset L ′ P with these characteristics, we can then train a binary classifier ℎ : R  → {Same, Different}.We can this classifier DV-Bin, since it is a binary classifier that uses DVs.
Without loss of generality, and for ease of notation, we will henceforth use ℎ as the function of two arguments ℎ : D × D → {Same, Different}, thus leaving implicit the phases of (a) mapping documents to feature vectors, and (b) computing DVs from the absolute differences of feature values.As a result, we can simply write ℎ(  ,   ) to indicate a predicted label in {Same, Different}.

Solving AA by means of Diff-Vectors
In this section we describe how SAV can be used to implement AA as downstream tasks.
In order to predict by whom among the authors in A a test document  has been written, and to do so by using a SAV classifier, it makes sense to look at how  relates to the training documents in terms of the Same and Different classes.For instance, if for all documents  ′ ∈ L written by   the pair (,  ′ ) is assigned by the SAV classifier to class Same, and if for all  ′′ ∈ L written by an author in A \ {  } the pair (,  ′′ ) is assigned to class Different, it would be reasonable to predict that  has been written by   .
Unfortunately, this uniformity rarely occurs in practice: in more typical cases the SAV classifier will assign to class Same, say, some pairs (,  ′ ) where  ′ has been written by   , and some pairs (,  ′′ ) where  ′′ has been written by an author other than   .This brings up the question: how should we act in the presence of such apparently contradictory outcomes?
Given that we need to build our AA algorithm on top of the output of the SAV classifier, it is in our best interest to squeeze every possible bit of information from this output.As a result, we will be interested in exploiting not just the binary prediction of the SAV classifier, but also its non-binary classification score, representing the degree of certainty with which it has issued this prediction.We assume that our SAV classifier is of the form i.e., returns classification scores that are posterior probabilities.These latter are values Pr(Same|  ,   ) that denote the probability that the SAV classifier attributes to the fact that   and   have been written by the same author, and are such that Pr(Different|  ,   ) = 1 − Pr(Same|  ,   ).
We explore two techniques for building AA classifiers on top of SAV classifiers, one inspired by lazy learning methods [Aggarwal 2014] and another inspired by the well known Stacked Generalisation algorithm [Wolpert 1992].
3.2.1 Lazy AA.The first SAV-based AA algorithm that we explore in this paper, and that we call Lazy AA, draws inspiration from distance-weighted -NN, but is different from it.Similarly to distance-weighted -NN, the underlying idea of our method is that, given a test document , if a training document  ′ authored by   is "stylistically similar" to , this brings evidence towards the fact that also  is authored by   , and this evidence can be quantified exactly by the amount of stylistic similarity.Differently from distance-weighted -NN, though, instead of having access to a function that computes the similarity between two documents, we here have access to a SAV (soft) classifier that computes the probability that the two documents are in class Same.It is thus just natural to compute the stylistic similarity between  and  ′ as Pr(Same|,  ′ ), i.e., as the probability that the SAV classifier attributes to the fact that  and  ′ have been written by the same author.
Our combination rule thus consists of selecting, for each author   ∈ A, the  training documents written by   that are stylistically most similar to our test document  (i.e., the ones for which Pr(Same|,  ′ ) is highest), and computing the average value of this stylistic similarity across these  documents; the author for which this average stylistic similarity is highest is predicted to be the author of .In symbols, this comes down to ℎ ′ (, L, ) = arg max where NN(, L,   , , ℎ) returns the  documents from training set L that have been written by author   and are closest to  according to the SAV classifier ℎ.Note that the ℎ ′ functional is parameterised by L (and ) since, as in all lazy learning methods, there is no proper training phase for ℎ ′ , and all the computation is carried out at classification time.
The optimal value for parameter  can be found via "leave-one-out" (LOO) validation on the training set L. That is, for each value of  in the tested range each training document   ∈ L is classified by a classifier ℎ ′ trained on L \ {  };  is thus set to the value that maximises a given effectiveness measure as computed on the entire set L. 7 If we use (vanilla) accuracy (i.e., the proportion of correctly classified instances) as the effectiveness measure, this process comes down to computing 7 One might wonder why we go for LOO, a traditionally expensive (and sometimes too expensive) way of optimising parameters, rather than the cheaper  -fold cross-validation ( -FCV).10.In other words, by applying the mapping  : R  → [0, 1]  to the training documents themselves we define a new "view" L ℎ = {( (  ),   )}  =1 of the training set L, in which the training documents are not represented via vectors of  stylometric features but, thanks to the underlying SAV classifier, via vectors of |L| posterior probabilities, with  () ∈ [0, 1]  .The training set L ℎ can directly be used to train a general-purpose classifier ℎ ′ : [0, 1]  → A in the feature space of posterior probabilities. 8Of course, in order to generate L ℎ = {( (  ),   )}  =1 we first need to train a SAV classifier ℎ via the DV-Bin method of Section 3.1.In the experiments of Section 4 we will concentrate on instantiations of ℎ ′ that are generated by the same learning method (e.g., logistic regression) used to generate ℎ.
At classification time, a given test document  is classified by first computing  () (this requires invoking  times classifier ℎ) and then invoking classifier ℎ ′ ( ()).
There are several important aspects in which Stacked AA differs from Lazy AA: i.e., use different sets of classes.Indeed, the base classifiers use the classes in {Same,Different}, since they are binary SAV classifiers, while the metaclassifier use the classes in A = { 1 , ...,   }, since it is a single-label multiclass AA classifier.

Solving AV by means of Diff-Vectors
It is fairly straightforward to take the algorithms described in Sections 3.2.1 and 3.2.2 and generate versions (that we will dub Lazy AV and Stacked AV ) that solve AV instead of AA.The only difference between Lazy AV and Lazy AA, and between Stacked AV and Stacked AA, is that in the AV versions of the two algorithms the codeframe used is binary, i.e., it is A = { * ,  * }; in particular, this means that for Stacked AV the metaclassifier ℎ ′ is a binary classifier instead of a multiclass classifier.Everything else is unmodified.
However, in preliminary experiments that we have run, both Lazy AV and Stacked AV proved substantially inferior to versions of Lazy AA and Stacked AA, respectively, in which we attribute document  to  * if the AA algorithm does so and we attribute  to  * if the AA algorithm attributes it to an author   different from .Concerning the reason why Lazy AV underperforms Lazy AA, this has likely to do with the fact that there is an a priori high probability that the  nearest neighbours in  * are, on average, closer to  than the  nearest neighbours in  * , since  * is a very large pool to choose from (this does not happen in AA, where, assuming an equal number of training documents per author, all pools are equally large); this can give undue advantage to  * over  * , and thus generate a large quantity of false negatives.Concerning the reason why Stacked AV underperforms Stacked AA, this has likely to do with the fact that the metaclassifier of Stacked AV does not put the available class information to the best use, i.e., conflates all labels different from  * into a single label  * that ends up being poorly characterised from the semantic point of view.
Therefore, in the rest of the paper the algorithms we will use for solving AV via DV-based representations will be the versions of Lazy AA and Stacked AA described at the beginning of the previous paragraph.A consequence of this is that any AA experiment that involves the use of either Lazy AA and Stacked AA and a codeframe A = { 1 , . . .,   }, is also de facto a set of  different AV experiments.In other words, we will not need to run separate AA and AV experiments, i.e., we will evaluate the AA experiments that we describe in Section 4 both in terms of AA and AV.

EXPERIMENTS
In order to test whether a representation based on DVs is advantageous with respect to a representation based on standard vectors, we compare these two different design choices in experiments that we run on four publicly available datasets (among which one that we here make available for the first time) and for all three authorship analysis tasks (AA, AV, SAV).The code to reproduce our experiments is available online at https://github.com/AlexMoreo/diff-vectors .

Datasets
We run experiments on four datasets consisting of textual documents annotated by author; our datasets are representative of different textual genres, lengths, and styles, are publicly available, and all consist of English texts.The four datasets are: • IMDB62.This dataset9 was created and made publicly available (along with an extended version, IMDB1million) by Seroussi et al. [2014].It contains film reviews collected from the popular Internet Movie Database, and accounts for 62 authors/reviewers and 1,000 reviews authored by each of them.In order to divide the 62,000 documents into a training set and :12 Corbara, Moreo, Sebastiani a test set we perform a stratified split, resulting in 700 training documents and 300 test documents for each author.We use these texts as examples of a "moderately formal" type of communication, since the reviews are not as short as, for example, online messages, and, despite some occasional slang, are written in a clear and correct (although often informal) manner.• PAN2011.This dataset10 was created for the PAN 2011 international authorship identification competition [Argamon and Juola 2011].The dataset is based on the Enron email corpus [Klimt and Yang 2004], i.e., the documents are emails annotated by author.Klimt and Yang  [2004] have removed personal names and email addresses and replaced them with specific tags, which means that an authorship identification method is not able to use this extremely revealing information.In our experiments we use the "Large" training set (containing 9,337 documents, altogether accounting for 72 different authors) and the corresponding test set (containing 1,300 documents altogether, by the same authors represented in the training set).
The emails are often extremely short, and show many characteristics of online communication; in order to avoid texts which are excessively short (and thus too difficult to attribute), we remove emails consisting of fewer than 15 words.• Victorian.This dataset11 was created and made publicly available by Gungor [2018].It consists of books by American or English 18th-19th century novelists, subdivided into segments of 1,000 words each by the creators of the dataset.They also (i) removed the first and last 500 words of each book, and, (ii) as a topic-filtering measure, retained only the occurrences of the 10,000 words most frequent in the dataset.The result is a corpus of more than 50,000 documents (i.e., segments) by 50 different authors; the corpus is an imbalanced one, with the least represented author accounting for 183 segments and the most represented one accounting for about 4,000 of them.In order to divide it into a training set and a test set, we perform again a stratified split, including 70% of each author's texts in the training set and the remaining 30% in the test set.We use these documents as examples of literary production characterised by a sophisticated style.• arXiv.This dataset, which we have created and made publicly available ourselves,12 consists of abstracts of single-author papers from arXiv. 13In order to limit domain-dependence we have harvested these abstracts by querying arXiv's API with a list of computer-sciencerelated keywords, mostly focused on machine learning. 14Computer science articles are seldom written by a single author, which means that this dataset is not large.The corpus somehow follows a power-law distribution, with few prolific authors and many authors accounting for very few abstracts each: we retained authors with at least 10 abstracts to their name, resulting in a total of 1,469 documents from 100 authors.The 2 most prolific authors have 34 abstracts to their name, the 10 most prolific authors have written 22 or more, while 50% of the authors have no more than 12 abstracts to their name.In order to divide the corpus into a training set and a test set we performed a stratified split, with the production of each author being split into a training set (70% of the abstracts) and a test set (30%).We use these abstracts as examples of "scientific communication", characterised by a precise and compact style, with an abundance of technical terminology.

Learners
We use logistic regression (LR) as the learning method.LR is a simple linear model that has delivered very good accuracy in a number of text-related applications.LR has two further advantages, i.e., (i) the classification scores returned by the classifiers trained by it are posterior probabilities, and (ii) these probabilities are well-calibrated. 15These are important advantages, since the methods we have described in Sections 3.2.1 and 3.2.2do rely on posterior probabilities, and obviously benefit from the fact that these posteriors are high-quality.
We optimise the hyperparameter  of LR (the inverse of the L2 regularisation strength) in the log-space {10  } =4 =0 , and select the value of  that minimizes the multinomial loss in a stratified -fold cross-validation (with  = 5). 16n order to generate the Same and Different training pairs, we adopt the following policy.Given a training set L, we first compute the number of Same pairs that can be generated.If there are fewer than 50,000 Same pairs, we generate them all; otherwise, we draw (uniformly at random) and generate 50,000 Same pairs.We then draw (again, uniformly at random) and generate as many Different pairs as the Same pairs we have generated.This is in order to guarantee a balanced training set, since there are usually many more potential Different pairs than Same ones.

Features
As for the choice of features, we stick to ones well-known and broadly adopted in the field of authorship analysis, i.e., features of a frequentistic nature that can be extracted automatically and that are believed to convey stylistic information; see for example [Eder 2011; Juola 2006; Stamatatos  2009] for an overview, and Kestemont et al. [2019, 2018] for a discussion of the most frequently used features in recent shared tasks focused on authorship analysis.
The features we use can be naturally subdivided into two groups.Group 1 is composed of • Function words.Each function word that appears in the training set is a feature in our vectorial representations.We use the list of English function words provided by NLTK. 17while Group 2 is composed of 15 A well-calibrated classifier is one that returns accurate posterior probabilities.An intuition of what "accurate posterior probabilities" means can be provided by the following example.If 10% (resp., 90%) of all the documents   for which ℎ (  ) = Pr( |  ) = 0.5 indeed belong to class , we can say that the classifier ℎ has overestimated (resp., underestimated) the probability that these documents belong to , and that their posteriors are thus inaccurate.Conversely, if this percentage is 50%, we can say that the classifier ℎ has correctly estimated the probability that these documents belong to , and that their posteriors are thus accurate.Indeed, we say (see for instance [Flach 2017]) that the posteriors ℎ (  ) = Pr( |  ) are perfectly calibrated (i.e., accurate) with respect to a (labelled The classifiers trained by means of some learners (and logistic regression is one of them) are known to return reasonably well-calibrated probabilities.Those trained by means of some other learners (such as Naïve Bayes) return probabilities which are known to be not well calibrated [Domingos and Pazzani 1996].Yet other learners (such as SVMs or AdaBoost) train classifiers that return confidence scores that are not probabilities (i.e., that do not range on [0,1] and/or that do not sum up to 1).In order to address these two latter cases, probability calibration mechanisms exist (see e.g., [Niculescu-Mizil  and Caruana 2005a,b; Platt 2000; Wu et al. 2004; Zadrozny and Elkan 2002]) that convert the outputs of these classifiers into well calibrated probabilities. 16We use the LogisticRegressionCV scikit-learn's implementation, see https://scikit-learn.org/stable/modules/generated/ sklearn.linear_model.LogisticRegressionCV.html. 17https://www.nltk.org/ • POS -grams.We extract parts of speech from our texts by using the Spacy library, 18 and we consider each POS -gram (for  ∈ [3, 4]) that occurs in the training set as a potential feature (where "potential" means "barring feature selection" -see below).• Word uni-grams.We consider each word that occurs in the training set as a potential feature.
• Character -grams.We consider each character -gram (for  ∈ [2, 5]) that occurs in the training set as a potential feature.
The features in Group 1 (i) are relatively few (typically:  (10 2 )), and (ii) are dense, i.e., all of them can be expected to occur to some degree in most texts.Given one of these features and given a document, as the value of the feature in the document we take its relative frequency in the document; for instance, the value of punctuation symbol "!" in a document will be the number of times symbol "!" occurs in the document divided by the number of punctuation symbols in the document.We also apply standardisation to the columns that these features generate in the document-by-feature matrix. 19he features in Group 2 are many (typically:  (10 4 ) or  (10 5 )).In order to deal with the fact that they may be too many, we apply to them filter-style feature selection, using the chi-square test as the term scoring function [Yang and Pedersen 1997] and retaining the 50,000 highest-scoring features.As the feature weighting function, rather than using plain relative frequency we use tfidf (an increasing function of relative frequency) in its standard "ltc" variant (see e.g., [Salton and  Buckley 1988]). 20The features in Group 2 are sparse, i.e., in a given document a large number of them will not occur in it; we do not apply any standardisation to the features in Group 2 since this would turn them into dense features, and this would be detrimental to efficiency.

Intrinsic evaluation of Diff-Vectors
Our "intrinsic" evaluation of DVs consists of SAV experiments, since SAV is the task that a classifier using DVs can solve directly.In these experiments we use a set of authors A = { 1 , . . .,   }, with  > 2, each one being the author of  documents.Given a dataset that contains a test set U, we test our systems on randomly drawn samples of test document pairs.The reason why we do not test on all possible pairs is (see also Section 2.3) a practical one, i.e., the fact that the number |U|(|U| − 1)/2 of all possible pairs is too high for all but the most trivial datasets.We randomly draw balanced subsets of 1,000 test pairs (500 positive and 500 negative) for each experiment.
We investigate the impact on performance of the number  of authors and the number  of training documents per author.Specifically, for the IMDB62, PAN2011, and Victorian datasets we run experiments varying the number  of authors in the set {5, 10, 15, 20, 25} and the number  of documents per author in the set {10, 20, 30, 40, 50}.Samplings are incremental, i.e., we do not resample from scratch; in other words, when moving from, say,  = 20 to  = 30, we add 10 new documents per author to the previous 20.Regarding the test set, for each choice of  we draw 2,000 random test pairs, 1,000 of which consist of texts written by some among the  authors present in 18 https://spacy.io/ 19Standardisation (aka z-scoring) is a normalisation process consisting of centring and scaling a random variable so as to force its distribution to have 0-mean and 1-variance, i.e., the z-score of a raw variable  is defined as  =  −  where  and  are the (sample) mean and (sample) standard deviation of  as estimated in the training set.For the benefits in accuracy deriving from standardising dense features, see [Moreo et al. 2018]. 20Using tfidf (which is indeed an increasing function of relative frequency) for weighting sparse features is customary in authorship analysis (see e.g., [Ikae 2021; Koppel et al. 2009; Koppel and Winter 2014; Menta and Garcia-Serrano 2021]).This function is the combination of the tf factor, which is somehow akin to relative frequency, with the idf factor, which lends a higher weight to features that are rare in the training set (see [Salton and Buckley 1988] for details); in authorship analysis, the use of idf is justified by the fact that rare POS -grams / word unigrams / character -grams can be considered more indicative of style than common ones , Vol. 1, No. 1, Article .Publication date: January 2023.

Diff-Vectors for Authorship Analysis :15
the training set (closed-set SAV ), while the other 1,000 pairs consist of texts written by  authors other than the  authors present in the training set (open-set SAV ). 21n order to compensate for the random effect introduced by sampling (authors, documents, and test pairs), we report results obtained by averaging across 10 runs for each combination (dataset, , ); we use the same random samples for all the methods we compare.The only exception is the arXiv dataset, which, due to its limited size, does not allow this extraction of multiple samples; hence, for this dataset we simply report experiments across 10 random train/test splits of the entire dataset.
We perform experiments in both closed-set SAV (Section 4.4.1) and open-set SAV settings (Section 4.4.2).We evaluate the performance in terms of vanilla accuracy (fraction of correctly classified pairs), which is a perfectly valid evaluation measure when the test set is balanced across the classes, such as the present one.
4.4.1 Experiments on closed-set SAV.In the closed-set scenario, the authors in the test set U are the same as in the training set.We here explore two variants of our method: • DV-Bin: the binary classifier discussed in Section 3.1.
• DV-2xAA: a method that solves SAV by building on top of the Lazy AA method discussed in Section 3.2.1.In other words, this method first predicts, for both unlabelled documents, who the author of the document is, and then checks if the two predicted authors are the same author.
We consider the following baseline systems: • STD-CosDist: This consists of a binary classifier trained to predict whether the pair belongs to Same or Different, where a pair of documents is represented by a vector of one feature only.The value of this feature is obtained by calculating the distance between the two documents, each represented by a "standard" vector, and where the distance function is the cosine distance.
We have also run experiments using the L1 or L2 distances in place of the cosine distance; we omit to report their results since cosine proved the best-performing one.The training set is transformed into pairs following the same policy as in DV-Bin (see Section 4.2).The classifier thus learns the distance threshold that best separates the Same pairs from the Different pairs.• STD-2xAA: This consists of a single-label multiclass classifier that operates on standard vector representations and that, as in DV-2xAA, solves SAV by performing closed-set AA for both documents and then checking if the two predicted authors are the same.
Both baselines are equipped with the same learner as our method, i.e., LR optimised by running the usual optimisation process for hyperparameter .The results clearly indicate that the DV-based variants perform well; of the two methods that achieve SAV by running AA on both documents (i.e., the DV-2xAA and STD-2xAA methods), the DV-based method is always better or much better than the standard vector-based method, and the All algorithms obviously improve their performance as the number of documents per author increases, with the sole exception of STD-CosDist.This latter fact might indicate that the optimal distance threshold that STD-CosDist finds is fairly stable, and is well estimated even by using few training data.However, it seems clear from these results that distances alone do not carry as much information as DVs do.
Figure 3 shows the distribution of Pr(Same| ′ ,  ′′ ) values for Same and Different pairs that STD-CosDist and DV-Bin compute.For this experiment we have set  = 20 and  = 50 for all datasets except for arXiv, where we have set  = 50 and used all the documents written by the 50 authors.(Note that the "2xAA" variants do not compute a single posterior probability and are thus not amenable to a similar analysis.)The STD-CosDist method manages to separate the posteriors of the Same and Different pairs to some extent in the IMDB62 and Victorian datasets, but it fails to separate them well in PAN2011 and arXiv.Interestingly enough, the posteriors generated by STD-CosDist are close to being normally distributed, both for the Same pairs and for the Different pairs.Things are very different for the DV-Bin method, which tends to generate much more polarised scores (i.e., separate the positives from the negatives much better), placing most of the density mass around 0 for Different pairs and around 1 for Same pairs, which is indicative of a very good performance.Still, the score distribution generated for Victorian and, especially, for PAN2011, reveal that the DV-Bin method still has room for improvement.

Experiments on open-set SAV.
In the open-set SAV experiments, there is no intersection between the set of  authors that we draw to compose the test set and the set of  authors observed during training.This aspect automatically rules out any attempt to perform SAV via authorship attribution (i.e., DV-2xAA); for this reason, in this setting the only DV-based method we test is DV-Bin.The baseline systems we consider are: • STD-CosDist: This is the same distance-based method that we have used in the closed-set SAV experiments.In this case the method is constrained to learn the optimal threshold from authors different from those in the test set.• Impostors: This is a method developed by Koppel and Winter [2014].We use our own implementation of the "blogger's" variant, which had proved superior to others in the experiments of [Koppel and Winter 2014] and amounts to using documents from the same domain (blogs in the original authors' experiments, documents from the training set in our case) as the impostors candidates.We use cosine as the distance function since in our experiments we mean std ttest DV-Bin .6631.966 STD-CosDist .6611.891 ** Impostors .6422.473 ** Table 2. Intrinsic evaluation of DVs: results on open-set SAV, using vanilla accuracy as the evaluation measure on dataset arXiv.The notational conventions are the same as in Table 1.
have found it to consistently deliver better results than the "minmax" criterion (the similarity function of choice in [Koppel and Winter 2014]).We set parameter  (the number of impostor candidates) to 50 instead of 250 (which was found to work well by Koppel and Winter [2014]) since our training sets are much smaller than those they considered (sticking to  = 250 would basically result in a random choice of impostor candidates); following [Koppel and  Winter 2014], the rest of the parameter values we use are  = 10 (number of impostors) and  = 100 (number of bagging trials).Also following [Koppel and Winter 2014], we optimise parameter  * (the decision threshold) on a validation set.Note that we have not used this baseline method in the closed-set SAV experiments, since in that case the "impostors" cannot be created.Figure 4 displays the experimental results we have obtained on IMDB62, PAN2011, and Victorian, while Table 2 reports the results obtained for the arXiv dataset.
There is no clear winner in the light of these results.DV-Bin seems to perform best in IMDB62, especially when the number of authors increases; all methods seem to perform comparably in PAN2011 and arXiv, and STD-CosDist seems to perform slightly better in Victorian.Somehow surprisingly, the Impostors method seems not to take advantage of the increase in the number of documents per author, likely because the number of actual impostors ( = 10) is set in advance and thus the method is indifferent to variations in .DV-Bin tends to perform poorly when the number of documents per author is very small (i.e., 10); this may be explained by the fact that the number of Same pairs that can be generated from 10 elements is relatively small.Concerning STD-CosDist, it proves a fairly stable method, as in the closed-set scenario.PAN2011 proves the hardest dataset here, with all methods performing only marginally better than a random classifier (which would obtain an expected accuracy of 0.50).Regarding the arXiv dataset, DV-Bin performs best on average, but the t-test reveals that this superiority is not significant from a statistical point of view.
Summing up, there is no strong enough empirical evidence to claim that the DV-Bin method outperforms Impostors in open-set SAV.However, there are some technical reasons why one should prefer the DV-Bin method to the Impostors method.The first concerns its efficiency.Impostors is a lazy method, meaning that it has no offline training phase, i.e., all inductive inference is carried out in the classification phase, and the workload that a single test pair entails is significant, since it involves computing the similarity between the test document and each training document, and computing  rounds of bagging for each impostor and for each element in the pair.Conversely, once trained, classifying an unlabelled pair using Diff-Vectors comes down to computing a simple linear combination of feature differences. 22The second reason concerns its applicability.By definition, the Impostors method cannot be used, as observed above, in closed-set SAV and, more generally, in SAV settings in which documents written by any of the authors of the test pair are observed in 22 Of course, it is fair to mention that the Impostors method incurs no cost for training.But this only applies if the value of the parameter  * is hard-wired.In practice, the optimal  * has to be estimated in a validation phase, which amounts to using a training set to perform repeated rounds of tests which, as indicated above, require a considerable computational effort.training.The reason is that the method would likely consider training documents by one of the test authors as candidate impostors (since these training documents are expected to be more similar to the test document), and thus the test author could wrongly be taken for an impostor of herself.
Figure 5 shows the distribution of the decision scores (i.e., the posteriors Pr(Same| ′ ,  ′′ )) for Same pairs and Different pairs that Impostors and Diff-Vectors compute.As for closed-set SAV, we set  = 20 and  = 50 for all datasets except arXiv, for which we instead set  = 50 and keep all documents per author.Recall that, in our open-set setting,  specifies both the number of authors involved in the training set and the number of authors involved in test (e.g., in the case of arXiv, we are using the entire dataset since there are 100 distinct authors).For ease of visualisation, we report the score values according to a logarithmic scale.
The Impostors method produces decision scores which tend to be very close to 0. The dashed vertical line indicates the decision threshold found optimal in the validation phase; this threshold is  * = 0.005 in all cases but in arXiv, where  * = 0.01 worked better.Note that this threshold succeeds in placing most of the negative scores below it, but still misclassifies many positives.Particularly, in PAN2011 and arXiv it fails to push many of the positive scores beyond the decision threshold.
The DV-Bin method instead succeeds at polarising the decision scores of Same and Different pairs in IMDB62 and arXiv, although it fails to allocate most of the negative mass below the 0.5 threshold in Victorian and, to a greater extent, in PAN2011.
Overall, as clear from a simple visual inspection, the DV-Bin method is better than the Impostors method at correctly separating the scores of the Same pairs from those of the Different pairs on each of our four datasets.

Extrinsic evaluation of Diff-Vectors
Our "extrinsic" evaluation of DVs consists of closed-set AA experiments.We do not run experiments for AV since, as discussed in Section 3.3, each of our AA experiments is also a set of  AV experiments, and be evaluated as such.
4.5.1 The AA results.At the core of our AA methods is a SAV classifier that operates on pairs of documents.Given a test document , attribution for it is performed by applying a combination rule to the posterior probabilities generated for pairs of documents consisting of the test document  and a training document  ′ .In particular, we explore: • Lazy AA: the lazy combination rule inspired by -NN discussed in Section 3.2.1.
• Stacked AA: the linear combination rule inspired by stacked generalisation discussed in Section 3.2.2.
In these experiments we consider a set of authors A = { 1 , . . .,   }, with  > 2, each one having  training documents.Given a test set U, the method is asked to attribute each test document to one of the authors in A, in a single-label multiclass fashion.We investigate the impact on AA accuracy of the number  of authors and the number  of documents per author.We let  take values in the set {5, 10, 15, 20, 25} as before, and we let  take values in the set {5, 10, ..., 45, 50}. 23As in Section 4.4, and for analogous reasons, the experiments are different for the arXiv dataset, in which the above fine-grained exploration is not possible.For both Lazy AA and Stacked AA, we use DV-Bin as the underlying SAV mechanism.
In this case, instead of vanilla accuracy we use  1 as the evaluation measure, since not all our datasets are balanced, and since vanilla accuracy is, differently from  1 , a notoriously bad measure for working with imbalanced datasets.For all datasets we report the values of macro-averaged  1 , i.e.,  1 averaged across the authors in A; in the case of the arXiv dataset, we also report microaveraged  1 (i.e.  3. Extrinsic evaluation of DVs: results on closed-set AA in terms of  1 for the arXiv dataset.The notational conventions are the same as for Table 1. predictions for all authors), since this is the only imbalanced dataset of the lot (and since microaverages would coincide with macro-averages in the perfectly balanced datasets IMDB62, PAN2011, and Victorian).All results are reported as averages across 10 runs that use different random seeds.
As our baseline we consider STD-AA, a single-label multiclass classifier trained to distinguish among the  classes from the observation of "standard" vectors of features.Given a test document, the classifier returns the author which obtains the maximum posterior probability.
Figure 6 displays the experimental results we have obtained for the IMDB62, PAN2011, and Victorian datasets (for the moment being let us disregard the curves for STD-Bin, on which we will comment later), while the first three rows of Table 3 report the results obtained on the arXiv dataset.
These results show the drastic superiority of DVs over standard vectors for AA.Only for  = 5, and infrequently for  = 10, does STD achieve (marginally) better results than the DVs-based variants; in this case, the reason might have to do with the fact that low values of  result in fewer Same pairs (e.g., for  = 5 there are only 10 unordered pairs), which might lead to subobptimal accuracy for the underlying SAV methods.Regarding our variants, the -NN -inspired combination rule consistently outperforms the linear one in IMDB62 and arXiv, and is slightly better or comparable in the rest of the cases.All methods understandably benefit from the increase in , but DVs seem to do so at a much greater rate; indeed, the increase in the number of training examples is quadratic in  for the DVs-based variants, while it is linear in  for STD.

4.5.2
The AV results.Concerning the AV task, note that macro-averaged  1 is also the right measure for evaluating AV; in fact,  1 as measured on a specific author  * is the right measure for evaluating AV once  * is considered the candidate author, and macro-averaged  1 is the right measure for computing the average performance for all possible choices of  * .As a consequence, the results reported in Figure 6 and Table 3 also count as an evaluation of the reported methods for the AV task.
For AV, we add another baseline (which we call STD-Bin), which consists of a binary classifier trained to distinguish between  * and  * from the observation of "standard" vectors of features; it is fair to add this baseline since it would be just natural to solve AV by means of a binary classifier, instead of by means of a multiclass classifier as STD-AA does.
However, the experimental results show STD-Bin to be inferior to STD-AA, as clear from both Figure 6 and Table 3.This is in keeping with the results of our preliminary experiments (discussed in Section 3.3) that had convinced us to abandon the idea of performing AV via Lazy AV and Stacked AV, in favour of versions of Lazy AA and Stacked AA in which we attribute document  to  * if the AA algorithm does so and we attribute  to  * if the AA algorithm attributes it to an author   different from .Concerning the likely reasons why this happens, the same considerations we made in Section 3.3 apply.In sum, given that STD-Bin is not a serious contender, the same considerations on the superiority of DV-based methods over standard methods that we had made in Section 4.5.1 for AA also apply to AV.

Efficiency
The improvements in performance obtained by DV-based methods with respect to methods based on standard vectorial representations can be attributed to the increase in the number of training examples resulting from pairing documents.However, this can be expected to come at a computational cost.In this section we compare the actual cost of DV-based methods with that of methods based on standard representations.4.6.1 Efficiency analysis.Let  = |L| be the number of training documents.Let also assume that the total cost of training a classifier is bounded by some function  on the number  of training documents, a cost which depends on the learning algorithm and its implementation; in other words, this total cost is  ( ()), where we can safely assume  () to grow faster than or equal to . (We take the number of features as constant, which means that this number does not impact our analysis of efficiency.)Let also assume that the classification of a document requires constant time, i.e., is  (1).
The cost of training the DV-Bin classifier of Section 3.1 comes down to the cost of generating the ( − 1)/2 pairs, which is  ( 2 ), plus the cost of training a classifier using ( − 1)/2 DVs, which is  ( ( 2 )).In practice, and in order to keep the computational burden under reasonable bounds, we only generate a fixed number of examples (i.e., we avoid generating all pairs first and discarding some of them later).Let  = , with  the number of authors and  the number of documents per author, as before; we generate all ( − 1)/2 pairs of type Same and as many pairs of type Different, thus ending up with ( − 1) documents, which has a cost  ( 2 ) =  (), plus, again, the cost of training the classifier from the ( − 1) documents, which is  ( ()); since we have assumed  () to grow faster than or equal to , the total cost of generating a DV-Bin classifier is  ( ()).At classification time we only need to compute the absolute difference between two vectors and invoke the classifier; for most classifiers (and for LR in particular) this cost can be considered constant, i.e.,  (1).
As a lazy algorithm, Lazy AA does not involve any real training phase.However, it seeks for the optimal value of , and this entails pre-computing a matrix of distances, which is done only once (and is  ( 2 )), plus sorting, for each of the  authors and for each of the  training documents (see Equation 8), the  training documents by this author (which is  ( log )).Altogether, this entails a total cost of  ( 2 +  log ) =  ( 2 log ).At classification time (for both the AA and the AV settings), we only need to sort, for each of the  training authors, the  training documents by this author, which means that this is  ( log ) =  ( log ).
Concerning Stacked AA, training the system entails (i) training a DV-Bin classifier, which, as argued above, has a cost  ( ()); (ii) creating the projections  () for each of the  training documents, which has a cost  ( 2 ) (since creating one such projection has a cost  ()); (iii) training the metaclassifier on the  vectors  () thus generated, which has a cost  ( ()); the total cost of training the system is thus the larger of  ( 2 ) and  ( ()).At classification time we need to generate the representation  () of the test document, which has a cost  (), and to invoke the meta-classifier, which we can assume to require constant time.
The Impostors method does not properly carry out a training phase, but incurs the cost of optimising the  parameter, which consists of carrying out  rounds of test, with  a user-defined parameter.The computational cost of testing whether two documents have been written by the same author or not entails computing, for each of the  training instances, the similarity with each test document, which is  (), plus sorting by similarity in order to choose the "impostors", which is  ( log ); this means that the total cost is  ( log ).Impostors then performs  rounds of bagging trials with respect to each of the  impostors, which adds a cost  () if we assume the similarity function to be computed in constant time.4.6.2Timings.As for the experiments reported in Figures 3 and 5, we report actual timings clocked for  = 20 and  = 50 in the case of the IMDB62, PAN2011, and Victorian datasets, and for the entire dataset in the case of arXiv.The variables that influence the analysis include the number of training documents (|L|), the number of pairs generated by DVs (|L P |), and the number of test documents (|U|).Recall that |L P | depends on the number of Same pairs that can be generated, which is fixed and amounts to 20(50 • 49)/2=24,500 for IMDB62, PAN2011, and Victorian, and which is variable and depends on the random split (we report the value averaged across 10 runs) for arXiv.The values are summarised in Table 5 for convenience.Recall that the number of test pairs in SAV tasks is fixed for all datasets and is equal to 1,000.Note that the arXiv dataset is split differently for SAV and AA since, although we used the entire dataset in both tasks, in the former we held half the authors out for composing the open set.All times refer to computations carried out on the same machine, equipped with a 12-core processor Intel Core i7-4930K at 3.40GHz with 32 GB of RAM, under Ubuntu 18.04.All methods run on CPU and are implemented using scikit-learn and the SciPy stack.We have parallelised all parallelisable steps, both in training and test, for all algorithms.
Table 6 reports the average time each method requires to complete the SAV task, both in terms of training time and testing time for each dataset.The method that uses standard vectors to compute the cosine distance (STD-CosDist) is much faster than any competing method, both in terms of training times and test times.This is due to the fact that cosine can be computed very quickly, and that the classifier operates on one single feature.At training time, both DV-Bin and Impostors are computationally much more expensive, with neither one being clearly better than the other.However, at classification time DV-Bin is much faster than Impostors, and costs no more than a few seconds to accomplish the 1,000 SAV computations, comparably to STD-CosDist.Impostors, on the contrary, requires much more time, and its testing times are higher than its training times.(Recall that, for the Impostors method, by "training" we mean the search for the optimal value of parameter  by using the training set, since Impostors does not properly perform any training.)Table 7 reports the average time each method requires to complete the AA Task.It is immediately evident that, at least on IMDB62, PAN2011 and Victorian, the two most expensive methods are the ones based on DVs, both at training time and at classification time (neither one is systematically better or worse than the other, though); the STD method is thus almost always the fastest.The reason for this high computational cost of the DV-based methods is that, despite the fact that DV-Bin proved very fast at classification time in SAV, Lazy AA and Stacked AA invoke DV-Bin many times, i.e., require computing, for all training documents (in the training phase) and for all test documents (in the testing phase), the similarity (viewed as a posterior probability computed by DV-Bin) with each training document.This has an important impact both in the training phase and in the testing phase.
More recently, [Ikae 2021; Menta and Garcia-Serrano 2021; Weerasinghe et al. 2021] tested the use of DVs for the open-set SAV problem at the recent PAN2021 shared task; in particular, Menta and  Garcia-Serrano [2021] propose a method that feeds DVs to a double-channel neural network, where the feature values are the tf-idf weights of character -grams in one channel, and of punctuation marks in the other channel.The outputs of the two channels are then concatenated in a final series of layers, that ultimately leads to the classification decision.
Finally, we note that the DV-based representations that we have discussed are reminiscent of ideas that have been independently explored in multilingual text classification.In particular, Moreo  et al. [2016] investigate the idea of applying lightweight random projections to the feature space.Mathematically, a random projection   of a matrix  ∈ R  , with  the number of documents and  the number of features, can be attained by multiplying it with a random matrix  ∈ R  , with  ≪  the number of dimensions.The term "lightweight" refers to the fact that the rows in  contain only two non-zero values (-1,+1).The pair-based version L P of a dataset L can be defined in terms of | •  |, where  ∈ R  is our document-by-feature matrix and  is instead a lightweight projection matrix   , this time with  , the number of pairs, much higher than ; here | • | represents the element-wise absolute value.Such a projection effectively computes the absolute difference between two chosen documents.

CONCLUSION
In this paper we have discussed the implications of the use of Diff-Vectors (DVs) in authorship identification tasks.A DV is a vector that represents a pair of documents in such a way that the value of a feature in the DV is the absolute difference between the relative frequencies (or increasing functions thereof) of the feature in the two documents.DVs were originally introduced by Koppel  and Winter [2014], but in that very same work these authors dismissed DVs as a "simplistic baseline method".Neither Koppel and Winter [2014] nor other authors studied the implications of the use of DVs in authorship identification; a systematic study of these implications is what this paper describes.
DVs are naturally geared towards solving the "same-author verification" (SAV) task, i.e., the binary task of deciding whether two documents have been written by the Same (possibly unknown) author or by Different authors.However, we have shown that both (i) (closed-set) authorship attribution (the task of predicting who among a given set of candidates is the true author of a given text), and (ii) authorship verification (the task of predicting whether a given author is or not the author of a given text), can be recast in terms of SAV; we have presented two original algorithms (Lazy AA and Stacked AA) that do this for both AA and AV.
In order to compare DV-based authorship identification methods with their counterparts based on "standard" vectors, we have carried out experiments on four datasets of texts labelled by author (one of which we have created ourselves and we here make publicly available for the first time) and representative of different textual genres, lengths, and styles, and on three authorship identification tasks (SAV, AA, AV).Our experiments have shown that DV-based methods are particularly suited to some authorship identification tasks and are not suited to others.For instance, the results indicate that neither standard methods nor DV-based methods clearly outperform each other on open-set SAV (see Section 4.4.2).Instead, DV-based methods vastly outperform the competition on three important tasks, i.e., (a) on closed-set SAV (see Section 4.4.1),(b) on closed-set AA (see Section 4.5), and (c) on AV (see Section 4.5).As we have argued, these benefits derive from the fact that, in many cases, DV-based methods may exploit more training data than methods based on standard vectors (see Section 2.3), and that DVs may make training more robust also when the above is not the case (see Section 2.4).
In future work we would like to study "diff-functions" other than the absolute difference of (a static, fixed increasing function of) the feature frequencies of the two documents, by testing the possibility of dynamically learning such functions from data, in the style of [Moreo et al. 2020].Other aspects worth exploring include testing DVs in authorship profiling tasks, such as native language identification.

Fig. 1
Fig. 1. 3-dimensional example of the surface (in green) that (ideally) separates the region of DVs belonging to Same (which corresponds to the tetrahedron comprised between the separating surface and the origin of the axes) and the region of DVs belonging to Different, in the linear case.When the number of features is , the tetrahedron becomes a -simplex and the separating surface is a ( − 1)-simplex.

Example 2. 1 .
Assume a set of 10 authors and a training set consisting of 100 training examples for each author.The DV-based representation gives rise to (1,000•999)/2=499,500 DVs, among which:

•
Word lengths.Each word length instantiated in the training set is a feature.• Sentence lengths.Each sentence length instantiated in the training set is a feature.• Punctuation symbols.Each punctuation symbol that occurs in the training set is a feature.

Figure 2
reports the experimental results we have obtained, displayed in terms of accuracy (on the  axis) as a function of the number of training documents per author (on the  axis), in datasets IMDB62, PAN2011, and Victorian (each corresponding to a different column), at varying number of authors (each corresponding to a different row).The values for combination (PAN2011,25,50) are missing since this combination is not feasible, given that in PAN2011 there are fewer than 50 authors (25 for the closed-set setting and 25 for the open-set setting) with at least 50 training documents each.Coloured dots each represent an average result across 10 experiments, while the colour band frontiers indicate ± one standard deviation from the mean.

Fig. 2 .
Fig. 2. Intrinsic evaluation of DVs: results on closed-set SAV, using vanilla accuracy (on the  axis) as the evaluation measure on datasets IMDB62, PAN2011, and Victorian.

Fig. 4 .
Fig. 4. Intrinsic evaluation of DVs: results on open-set SAV, using vanilla accuracy (on the  axis) as the evaluation measure on datasets IMDB62, PAN2011, and Victorian.

Fig. 5 .
Fig. 5. Distribution of decision scores for positive and negative (i.e., Same and Different) pairs as computed by the Impostors method (1st column; note the log scale) and by the DV-Bin method (2nd column).

Fig. 6 .
Fig. 6.Extrinsic evaluation of DVs: results on closed-set AA in terms of  1 for the IMDB62, PAN2011, and Victorian datasets.
,  ′′ ) of different authors, and for each such pair there are 100•100=10,000 pairs of examples in which one example is by  ′ and the other example is by  ′′ ; of these (a) 9•(100•100)=90,000 are such that one of  ′ and  ′′ is  * ; (b) 36•(100•100)=360,000 are such that neither of  ′ and  ′′ is  * .
e., it is used for labelling  different classes, each of which thus ends up having a smaller number of positive training examples.The following example makes the point more concrete.Assume we are dealing with closed-set AA; assume a set of  = 10 authors and a training set consisting of  = 20 training examples for each author.The standard representation gives rise to  •  = 200 training vectors, 20 for each class, while the DV-based representation gives rise to ( − 1)/2=19,900 training vectors, among which 10•(20•19)/2=1,900 DVs for class Same and 10 • 9 • 20 2 /2 = 18,000 DVs for class Different.□ More in general, if we have  authors and  training examples per author, in closed-set AA we have ( − 1)/2 DVs of class Same and ( − 1) 2 /2 DVs of class Different, which means that the ratio between the number of training examples of Same and the number of training examples of 1[] is an indicator function returning 1 if statement  is true and 0 otherwise.This optimisation can be performed very quickly if the posterior probabilities Pr(Same|  ,   ) are computed only once for all   ,   ∈ L and stored for fast reuse.Similarly, NN(, L,   , , ℎ) can be made to return the top  elements (for different values of ) from a fully ranked list that is computed once and reused when necessary.Note also that the majority of these operations are amenable to parallelisation.3.2.2 Stacked AA.Stacked AA (so called since it is inspired by stacked generalisation -[Wolpert 1992]) consists of an AA (single-label multiclass) classifier ℎ ′ , trained by general-purpose learning algorithms, that classifies documents represented by vectors of posterior probabilities Pr(Same|,   ), each of which has been returned by an underlying, previously trained SAV classifier ℎ (more precisely, a DV-Bin classifier of the type described in Section 3.1).More in detail, in order to predict who among the authors in A has written document , we represent  via a vector  () = (ℎ(,  1 ), . . ., ℎ(,   )) The reason is that, in our case, LOO is no more expensive than  -FCV because we are in a lazy learning context.In other words, in traditional eager learning contexts LOO requires | L | classifier retrainings, while  -FCV requires only  ≪ | L | classifier retrainings; however, in lazy learning contexts there are no retrainings because classifiers are not "trained", since all inductive inference is carried out at classification time., Vol. 1, No. 1, Article .Publication date: January 2023.where = (Pr(Same|,  1 ), . . ., Pr(Same|,   )) (10) of  posterior probabilities, one for each training example in L. The -th value in this vector is the value ℎ(,   ) = Pr(Same|,   ), where   is the -th training example.In other words, in order to classify  we first need to perform |L| SAV classifications, where the -th such classification attempts to predict whether the test document  was written by the same author who also wrote training document   .At training time, we train the AA classifier ℎ ′ by using all the training examples in L represented in the style of Equation evidence is provided by all training examples, and not just by the  examples most similar to the test example, as is instead the case in Lazy AA;• in Stacked AA, the combination rule (i.e., the rule that assembles the evidence provided bythe training examples into a final decision) is learnt by a metaclassifier, i.e., it is not static, as is instead the case in Lazy AA; • in Stacked AA, learning is performed offline (since the metaclassifier is trained before the testing phase begins), while in Lazy AA all inductive inference is carried out at classification time.One important aspect in which Stacked AA differs from Stacked Generalisation, instead, is that in Stacked Generalisation the metaclassifier and the base classifiers are homogeneous, i.e., all use the same set of classes, while in Stacked AA the metaclassifier and the base classifiers are heterogeneous, are the parameters to be learned, and the set of functions {ℎ ( •,   ) }   =1 plays the role of a set of basis functions centred at the training points.
Table 1 reports the results for the arXiv dataset.

Table 1 .
Intrinsic evaluation of DVs: results on closed-set SAV, using vanilla accuracy as the evaluation measure on dataset arXiv.Boldface indicates the best method.Symbols * and ** denote the method (if any) whose score is not statistically significantly different from the best one at  = 0.05 (*) or at  = 0.001 (**) according to a paired sample, two-tailed t-test.No symbols * and ** appear in this particular table since all differences are statistically significant.samehappens of the two non-AA-based methods.The top-performing method is unquestionably DV-2xAA, which always outperforms (often by a very large margin) all others, for all numbers  of authors and for all numbers  of training examples per author.As for the reason why DV-2xAA outperforms DV-Bin, we conjecture that this may happen because the Lazy AA method uses only evidence conveyed by few relevant training documents (the  documents most similar to the test document, for both test documents), thus filtering out other less relevant documents; this is in keeping with the fact that methods based on nearest neighbours, as our DV-2xAA method, always pick, during their parameter optimisation phase, values of  that are much smaller than the entire size of the training set.

Table
,  1 as obtained on a global contingency table generated by all the classification

Table 4
summarizes all the costs involved.

Table 4 .
Computational cost of a number of algorithms discussed in this paper.