Web-Scale Academic Name Disambiguation: the WhoIsWho Benchmark, Leaderboard, and Toolkit

Name disambiguation -- a fundamental problem in online academic systems -- is now facing greater challenges with the increasing growth of research papers. For example, on AMiner, an online academic search platform, about 10% of names own more than 100 authors. Such real-world challenging cases have not been effectively addressed by existing researches due to the small-scale or low-quality datasets that they have used. The development of effective algorithms is further hampered by a variety of tasks and evaluation protocols designed on top of diverse datasets. To this end, we present WhoIsWho owning, a large-scale benchmark with over 1,000,000 papers built using an interactive annotation process, a regular leaderboard with comprehensive tasks, and an easy-to-use toolkit encapsulating the entire pipeline as well as the most powerful features and baseline models for tackling the tasks. Our developed strong baseline has already been deployed online in the AMiner system to enable daily arXiv paper assignments. The public leaderboard is available at http://whoiswho.biendata.xyz/. The toolkit is at https://github.com/THUDM/WhoIsWho. The online demo of daily arXiv paper assignments is at https://na-demo.aminer.cn/arxivpaper.


-1000K
Figure 1: The sizes of the prevailing name disambiguation benchmarks.Among these, WhoIsWho is the largest one with 1,000+ names, 70,000+ authors, and 1,000,000+ papers.

INTRODUCTION
Name disambiguation, aiming to clarify who is who, is one of the fundamental problems in online academic systems such as Google Scholar2 , Semantic Scholar3 , and AMiner 4 .The past decades have witnessed a huge proliferation of research papers in all fields of science.For example, Google Scholar, Bing Academic Search, and AMiner have all indexed about 300 million papers [10,33,36].As a result, the author name ambiguity problem-the same authors with different name variants, or the different authors with the exact same name or homonyms-has become increasingly sophisticated in modern digital libraries.For example, as of January 2023, there were over 10,000 authors with the name "Yang Yang" on AMiner.Three of them are displayed in Figure 2. Since all three authors are computer scientists, there are intricate connections between their papers.Paper  5 , which belongs to "Yang Yang(THU)", was mistakenly assigned to "Yang Yang(UND)", because both "Yang Yang" coauthored with "Yizhou Sun", leading to the appearance of reliable co-author and co-keyword relationships between  5 and the correct paper  4 of "Yang Yang(UND)".Furthermore, "Yang Yang(THU)" and "Yang Yang(ZJU)" are the same person but are separated into two different authors due to organization shifts after graduation.This real-world example demonstrates the great challenges of name disambiguation in online academic systems, which, however, can not be addressed by existing efforts [3,18,20,21,32,35,38,[47][48][49], because of the small-scaled low-quality benchmark and non-uniform task designs with evaluation settings.
In particular, even though several name disambiguation benchmarks, such as PubMed [40,45], MAG [46], DBLP [14], etc. [17,37], have been directly harvested from existing digital libraries, inevitably spurious information and assignment mistakes, as shown in Figure 2, are detrimental to build effective algorithms [4,44].In light of this, others attempt to manually annotate a small amount of high-quality data from the online noisy data in order to reduce the negative impact of these noises [11,30,34].However, as illustrated in Figure 1, the majority of them lack an adequate number of instances.Additionally, on top of these benchmarks, previous efforts have defined a variety of tasks and evaluation protocols, preventing us from fairly comparing different methods to promote the development of the name disambiguation community.Present Work.We present WhoIsWho, a benchmark, a leaderboard, together with a toolkit for web-scale academic name disambiguation.Specifically, WhoIsWho has the following characteristics: • Interactive large-scale benchmark construction.To create a challenging benchmark, we devise an interactive annotation process to label paper-author affiliations under a single name with high ambiguity with the aid of the developed visualization tool.10+ professional annotators were employed to conduct the annotation task with each of them spending about 24 working months.To date, we have released a large-scale, high-quality, and challenging benchmark that contains over 1,000 names, 70,000 authors, and 1,000,000 papers.Figure 1 shows the WhoIsWho benchmark is orders-ofmagnitude larger than existing manually-labeled datasets.
• Contest leaderboard with comprehensive tasks.To fairly compare various name disambiguation methods, we sponsor contests with two tracks: The first is From-scratch Name Disambiguation (SND) aiming at grouping papers by the same author together in order to fulfill the need to create an original academic system from scratch.The other is Real-time Name Disambiguation (RND), also known as incremental name disambiguation, which targets at assigning newly-arrived papers to the existing clarified authors.The RND task is crucial to maintain a regular assignment of papers on existing online academic systems owning a substantial amount of clarified author profiles.Beyond these, we additionally define Incorrect Assignment Detection (IND), which attempts to remedy online paper-author affiliation errors in order to guarantee the reliability of academic systems.To date, three-round contests have been held on the first two tasks, attracting more than 3,000 researchers.Furthermore, we host a regular leaderboard to keep track of recent advances.The contest for the IND task is under active preparation.
• Easy-to-use toolkit.To facilitate researchers to quickly get started in the name disambiguation area, we summarize our research findings and organize an end-to-end pipeline to standardize the entire name disambiguation process, including data loading, feature creation, model construction, and evaluation, We thoroughly    investigate the contest winner methods, assemble the most effective features and models, and encapsulate them into the toolkit.The end users are free to directly invoke the baselines and encapsulated features to develop their own algorithms.
We provide in-depth analyses of the features adopted in methods of contest winners, finding that blending the multi-modal features, i.e., the semantic features involving paper attributes and the relational features created by co-author, co-organization, and co-venue links, contributes the most to the performance of name disambiguation methods.On top of these discoveries, we provide simple yet effective baselines (RND/SND-all) that perform on par with the top contest methods.Particularly, RND-all has been deployed on AMiner for daily arXiv paper assignment.
To sum up, WhoIsWho is an ongoing, community-driven, opensource project.We intend to update the leaderboard as well as offer new datasets and methods over time.We also encourage contributions at oagwhoiswho@gmail.com.

WHOISWHO BENCHMARK
This section first introduces the interactive annotation process for constructing the large-scale high-quality benchmark and then presents the intrinsic distributions of the benchmark.

Interactive Benchmark Construction
We formalize the interactive benchmark construction pipeline into two sub-modules: data collection and data annotation.

Dataset
Collection.Practically, we collect the raw data from AMiner [33].To acquire name disambiguation data with less noise and also higher ambiguity, we adopt the following rules, Select authors by H-index.For each author in AMiner, we compute the H-index [12], a metric used to measure the impact of experts, and then we keep the authors with the higher H-index scores.If authors are more well-known, it is assumed that their profiles contain less noise, because they may have already clarified themselves on the academic platform.Concretely, we filtered out authors with an H-index less than 5 by sorting them in descending order based on their H-index values.This threshold is a widely accepted criterion in the literature for identifying authors with significant impact in their research field.
Choose names with high ambiguity.We count the number of authors with the same name in AMiner.The term "same name" refers to the name-blocking ways to unify names, such as moving the last name to the first or preserving all name initials but the last name [2,14].For example, the variants of "Jing Zhang" include "Zhang Jing", "J Zhang" and "Z Jing".A name is more ambiguous if it is used by more authors.We filter names with fewer authors than a threshold to make WhoIsWho challenging 5 .
After obtaining names with high ambiguity with the corresponding authors for each name, we collect papers for each author.Specifically, we collect the title, author names, organizations of all authors, keywords, abstract, publication year, and venue (conference or journal) as attributes of papers.Additionally, there are a large number of papers that have yet to be assigned to any authors.To increase the challenge of the benchmark, we also gather these papers, denoted as unassigned papers, whose authors share the same name as these in the benchmark, which may be assigned to the authors in the benchmark during the data annotation pipeline.
2.1.2Dataset Annotation. Figure 2 demonstrates some real-world hard cases of name disambiguation, which are quite challenging for annotators to label because of the intricate relationships between papers.In light of this, we design an interactive annotation tool 6 adapted from [29] to not only provide detailed information about papers and authors but also to offer various practical atomic operations to help annotators in performing arbitrary actions.A toy example is shown in Figure 12.The tool allows annotators to annotate interactively because each time an action is taken, the author profiles are updated and displayed to the annotators.
With the help of the tool, we establish four standardized annotation steps (detailed in Table 1) to ensure the manual labeling process can be conducted in a reasonable manner.Overall, the annotators are authorized to remove incorrect papers, add unassigned papers, split an author into two authors, and merge two authors.Specifically, the first "Clean" step allows annotators to remove or split obviously incorrect papers from the concerned author.Such papers cover different topics with the concerned author.Then, the "Validate" step allows annotators to conduct the same "Clean" function on incorrect papers that are hard to identify.Such papers cover relevant topics to the concerned author.After that, the "Add" step  enables annotators to add unassigned papers to associated authors.Finally, the "Merge" step allows annotators to blend the papers of two authors into a single author.Since the last three steps are more challenging than the first step, three annotators are requested to annotate the same name with their results aggregated by majority voting.Notably, annotators label all the papers of authors under the same name together each time.To prevent them from simply removing arbitrary papers, annotators must retain at least 80% of the papers for each author.
In summary, on one hand, the devised interactive annotation process, which provides abundant facts among papers, fully supports annotators to label the dataset effectively.On the other hand, each paper is examined by at least 10 skilled annotators, which further guarantees the quality of WhoIsWho.

Statistics of WhoIsWho Benchmark
We present the holistic analysis to demonstrate the superiority of the WhoIsWho benchmark in multi-facets, as illustrated in Figure 3. Accuracy of the Annotated Authorship.We first check the accuracy of the manually-labeled authorship.To achieve this, we randomly sample 1,000 papers from the benchmark and manually verify which papers belong to which authors.Each paper is verified by three skilled annotators via major voting.The resultant accuracy is 99.6% with only four assignment errors, indicating that the benchmark offers a large number of high-quantity instances.Publication Date Distribution.P1 < l a t e x i t s h a 1 _ b a s e 6 4 = " L i 9 F / A R Y U q g 1 S 5 H i 5 2 I o o C z z 3 F k = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m 0 q M e C F 4 8 V7 Q e 0 o W y 2 m 3 b p Z h N 2 J 0 I J / Q l e P C j i 1 V / k z X / j t s 1 B W x 8 M P N 6 b Y W Z e k E h h 0 H W / n c L a + s b m V n G 7 t L O 7 t 3 9 Q P j x q m T j V j D d Z L G P d C a j h U i j e R I G S d x L N a R R I 3 g 7 G t z O / / c S 1 E b F 6 x E n C / Y g O l Q g F o 2 i l h 0 b / s l + u u F V 3 D r J K v J x U I E e j X / 7 q D W K W R l w h k 9 S Y r u c m 6 G d U o 2 C S T 0 u 9 1 P C E s j E d 8 q 6 l i k b c + N n 8 1 C k 5 s 8 q A h L G 2 p Z D M 1 d 8 T G Y 2 M m U S B 7 Y w o j s y y N x P / 8 7 o p h j d + J l S S I l d s s S h M J c G Y z P 4 m A 6 E 5 Q z m x h D I t 7 K 2 E j a i m D G 0 6 J R u C t / z y K m l d V L 2 r q n d f q 9 R r e R x F O I F T O A c P r q E O d 9 C A J j A Y w j O 8 w p s j n R f n 3 f l Y t B a c f O Y Y / s D 5 / A H P e 4 1 y < / l a t e x i t > P3 < l a t e x i t s h a P2 < l a t e x i t s h a 1 _ b a s e 6 4 = " Y + 3 4 w 2 a A J 9 g p 2 y w I u 3 x f B 3 N t n W I = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k V I 8 F L x 4 r 2 g 9 o Q 9 l s N + 3 S z S b s T o Q S + h O 8 e F D E q 7 / I m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 P5 < l a t e x i t s h a 1 _ b a s e 6 4 = " 2 u Z N 3 i 9 / v C f N r K C S 3 j b y / r s r Y U E = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k V I 8 F L x 4 r 2 g 9 o Q 9 l s N + 3 S z S b s T o Q S + h O 8 e F D E q 7 / I m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 B l e 4 c 2 R z o v z 7 n w s W w t O P n M K f + B 8 / g D Q / 4 1 z < / l a t e x i t > P4 < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 t M 7 P F X 4 4 6 K O u y / q U 7 N r q 8 R F 0     recorded before the year 2000 since managing digital libraries was still a relatively new technique at that time.As the internet develops rapidly after 2020, the number of digital records increases more quickly.However, there are fewer records around 2022 than there were around 2010, suggesting that the online name disambiguation system may not be able to assign the latest papers in time.
Author Position Distribution.Several datasets focus on disambiguating the author on a particular position in the paper.For example, Song-PubMed [30] is created for disambiguating the first author, which introduces biased information to name disambiguation methods toward certain specific author positions.On the contrary, the WhoIsWho benchmark takes all author positions equally into account, as shown by the rational long-tail curve in Figure 3(b).Name Ambiguity Distribution.Author names of different ethnic groups typically have varying degrees of ambiguity.Chinese authors, for example, are more difficult to disambiguate than other nationalities [9,15,16].Figure 3(c) and 3(d) illustrate the distribution of the clarified author profiles per Chinese and international name respectively in AMiner, indicating Chinese names are more ambiguous than international ones.As we focus on constructing a benchmark with high ambiguity that facilitates name disambiguation methods, we collect more Chinese names, covering about 87% of author names in our dataset, than international names.Paper Number Distribution.We also present the distribution of the number of papers per author in the benchmark, as shown in Figure 3(e).The long-tail distributions indicate that most of the cases have a manageable quantity and only a few famous scientists own hundreds of publications.Domain Distribution.Compared with several datasets that merely cover biased domains, such as datasets based on PubMed [30,45] focus on the field of medical science, WhoIsWho has great coverage of general disciplines.To confirm this, we randomly sample 100,000 papers and then adopt the taxonomy rank of SCImago Journal Rank (SJR) 7 from Scopus to obtain paper domains.The top-10 highest frequency domains are shown in Figure 3(f), which implies the benchmark not only covers a variety of domains but also is a representative of the overall distribution in AMiner.

WHOISWHO TASKS & CONTESTS
In this section, we first present three name disambiguation tasks with standardized evaluation protocols.Then we review threeround historical contests and a regular leaderboard built on defined tasks with different released versions.

Task Formations and Evaluation Protocols
Here we formalize the three tasks i.e, from-scratch name disambiguation, real-time name disambiguation, and incorrect assignment detection, with evaluation metrics, as shown in Figure 4. Definition 4. Candidate Authors.Given a person name denoted by , A  = {  1 , . . .,    } is a set of candidate authors with the same name .The term "same name" refers to the ways to unify names using name blocking techniques [2,14].
3.1.1From-scratch Name Disambiguation.At the beginning of building digital libraries, we need to partition a large number of published papers into groups, each of which represents papers that belong to a single person.To achieve this, we formalize from-scratch name disambiguation as a clustering problem.Problem 1. From-scratch Name Disambiguation (SND).Given a set of candidate papers P  , SND aims at finding a function Φ to partition P  into a set of disjoint clusters   , i.e., Φ(P  ) →   , where where each cluster consists of papers owned by the same author, i.e., I( Evaluation Protocol.We adopt the macro pairwise-F1 to evaluate the performance of related SNA methods, which is widely adopted by many SND methods [18,28,31,47,49].

Real-time Name Disambiguation.
Assigning new papers to existing authors is crucial for online digital libraries at the current stage.For instance, AMiner receives over 500,000 new papers each month.To this end, we formalize the real-time name disambiguation as a classification problem.Problem 2. Real-time Name Disambiguation (RND).Given a paper   , i.e, the paper with its author name  to be disambiguated, and the set of candidate authors A  , the right author  * can be either a real author in A  or a non-existing author profile, i.e., NIL.We target at learning a function to assign the paper   to  * , i.e., Ψ(  ,   ) →  * Note that NIL situations are found frequently in online academic platforms.Assuming that undergraduate students publish their first paper at a conference or journal, but the current database has not yet established their author profile, it is infeasible to assign the paper to any authors.In light of this, we have incorporated the NULL scenarios in the RND task.Formal efforts [3] also take into account the NIL situation, however, they create synthesized NIL labels rather than incorporating the actual NIL cases.To our best knowledge, we are the first to consider the NIL situation in the WhoIsWho benchmark with manually-labeled real NIL cases.Evaluation Protocol.We propose the weighted-F1 to evaluate the methods that solve the RND problem.For an author  to be disambiguated, we calculate the metrics as follows: , where precision measures the correctness of papers predicted to , and recall measures how many papers from 's actual papers could be correctly assigned to .Then we calculate the F1 score by the precision and recall for each author.After that, we average the F1 score by the weight of each author which is determined by the percentage of their papers that will be assigned.We adopt the weighted average strategy to alleviate the negative effects of some extreme cases, like authors who only have one paper.

Incorrect Assignment Detection.
As inevitable cumulative errors brought via the methods of SND and RND greatly affect the efficacy of subsequent assignments, Incorrect Assignment Detection is a vital task to detect and remove wrongly-assigned papers.

Problem 3. Incorrect Assignment Detection (IND). Given a conflated author entity 𝑎
Assuming  1 covers the highest percentage of papers within  * , we set  * =  1 .Consequently, the papers owned by { 2 , • • • ,   } are defined as incorrectly-assigned papers to be detected.
Evaluation Protocol.We leverage Area Under ROC Curve (AUC), broadly adopted in anomaly detection [22] and Mean Average Precision (MAP), which pays more attention to the rankings of incorrect cases, as the evaluation metrics.
3.1.4Discussion.The proposed three name disambiguation tasks shed light on the life cycle of concerned name disambiguation problems in online digital libraries.Specifically, the SND task reflects the requirements of building digital libraries at the early stage; the RND task corresponds to the urgent needs of current online platforms; and the IND task is devoted to correcting the accumulated errors of name disambiguation algorithms, which is critical to maintaining the reliability of the name disambiguation system.In addition, the three tasks can serve as the backbone of any other complex name disambiguation tasks.We believe name disambiguation methods, which perform better on these tasks, are powerful enough to handle the majority of name disambiguation situations.Although Zhang and Tang [44] have already proposed similar types of tasks, we improve them by 1) taking the NIL issue into account and formalizing the RND problem into a more general classification problem instead of a ranking problem, 2) standardizing the evaluation protocol of the three tasks, and 3) arranging contests for the first two tasks to prompt their accomplishments.

Historical Contests & Regular Leaderboard
From 2019 to 2022, WhoIsWho periodically released three versions of benchmarks.To promote the development of the community, we sponsored three rounds of name disambiguation contests on BienData 8 .The timeline of released benchmarks and corresponding contests is depicted in Figure 5.To date, more than 3,000 people in the world, have downloaded the WhoIsWho benchmark more than 10,000 times.WhoIsWho has already become one of the most wellknown and representative benchmarks of the name disambiguation community.In addition, to assist researchers who are interested in resolving name disambiguation problems at any time, we maintain a regular leaderboard with the contest based on the most recent benchmarks released by WhoIsWho.
In the following part, we briefly revisit the methodologies proposed by contest winners, based on which we conduct an in-depth empirical analysis to probe key factors that may have a significant impact on the performance of name disambiguation methods.

Methodologies of the Contest Winner.
We revisit the approaches of contest winners in the first two tasks of SND and RND since they have the best performance to date.How to measure the fine-grained similarities between papers and authors is vital to finding a solution to both tasks.Thus, to measure these similarities, we need to build the interaction between authors and papers, which needs to be primarily explored.In the following part, we skip over some technical details and focus on the strategies to quantify connections between papers and authors.From-scratch Name Disambiguation.The SND task aims to group the papers written by the same author.The contest winner divides the similarities across papers into two categories.
Semantic Aspect.The contest winner views the paper's title, venue, organization of authors, year, and keywords as the semantic features, based on which they measure the topical similarities between papers.Specifically, they first learn word2vec [23] embeddings based on the semantic features of all the papers in the WhoIsWho benchmark.Then they project the semantic features of a paper into corresponding word embeddings and average them as the paper embedding.Finally, they calculate the soft semantic similarities between papers based on these semantic embeddings.Relational Aspect.The contest winner takes author names and organizations as the relational features of papers.For example, the concurrence of the same author name in two papers reflects their relationships.Specifically, they construct a relational graph by considering papers as nodes and the connections between papers as edges.If two papers have identical coauthors' names, the edge of coauthors is added.When two papers have the same organization for the concerned author, the edge of co-organization is added.After that, they employ the metapath2vec [6] to obtain relational embeddings of papers.Finally, they calculate the relational similarity score between papers based on these relational embeddings.
Furthermore, the contest winner combines the two multi-modal similarities to estimate the final similarities between papers and then uses DBSCAN [7] to obtain the clustering results.Real-time Name Disambiguation.The RND task focuses on measuring connections between the paper and a collection of papers from each candidate author.The contest winner captures more precise semantic features between unassigned papers and candidate authors than the SND task as follows.Semantic Aspect.Besides the soft semantic features, i.e., those measured via embedding techniques, they also consider the ad-hoc semantic features, i.e., those measured via hand-crafted features.In terms of the soft semantic features, they identify similarities between the target paper and each paper of the candidate author, just like SND does.Then they adopt aggregation functions to obtain overall similarities between the target paper and all papers of the candidate author.As for the ad-hoc semantic features, they propose 36-dimensional hand-crafted features to explicitly capture the semantic correlations between the target paper and the candidate author.The complete features are listed in Table 5.Finally, they concatenate the soft semantic features and the ad-hoc semantic features to create the final similarity features.Then they adopt ensemble methods to acquire the classification results.
Being aware that the contest winner's methods disregarded the characterization of relationship properties.We make the following hypotheses: 1) Unlike the SND task, which only requires building a relational graph of papers from one name once, the RND task needs to build time-consuming graphs between unassigned papers and corresponding candidate authors with each unassigned paper once.2) Some ad-hoc features can somewhat capture relational correlations.For example, the coauthor-occurrence feature, which counts the number of coauthors between the target paper and a candidate author, can be viewed as the coauthor edge weight on virtual paper-author graphs.Nevertheless, how to model the relational correlations in the RND task is still under-explored.Incorrect Assignment Detection.The IND task targets at detecting accumulated incorrect papers, which is important to guarantee the reliability of academic systems.However, there is no available IND benchmark in the current stage.To this end, we have released V3.1 data consisting of 1,000+ authors and 200,000+ papers dedicated to the IND task.To our best knowledge, we are the first to specify and release the corresponding IND benchmark.Furthermore, we are planning a contest based on the released WhoIsWho-v3.1 benchmark for the IND task in a few months.

Discussion
. In summary, we observe a crucial insight of establishing a good approach to comprehensively measure the correlations among papers is to intertwine multi-modal features, i.e., semantic and relational features.The contest results show that methods capturing both two aspects of features produce impressive results.Although the contest for the third task IND has not been held, we assume a similar result may be drawn for the IND task, as they also depend on evaluating the agreements among papers.

EMPIRICAL FACTOR ANALYSIS
We conduct in-depth ablation studies to understand the effect of various factors on name disambiguation performance.To ensure fair comparisons, we only modify the factors of interest, leaving others unaltered.We adopt metrics defined in WhoIsWho tasks for evaluations.For each experiment, we run 5 trials and report the mean results at the WhoIsWho-v3 validation set.

Semantic Feature Importance
We study the effects of accessible paper attributes, i.e., title (T), keywords (K), abstract (A), venue/journal (V), year (Y), author names (N), and organizations of authors (O), on the SND and RND tasks.
From-scratch Name Disambiguation.To perform the soft semantic feature analysis, we adopt a similar implementation pipeline with the contest winner method while exploring different attributes.
Results.The results are shown in Fig. 6(a).The fields of title, keywords, author name, and organization play a more significant effect on disambiguation than others.The field of abstract contains much redundant words and noises.The venue and year also fail to access the similarities among papers.(1) Combining consistent attributes might better express semantic correlations.We combine these four effective single features, as shown in the yellow bars.The title + name even perform worse than its constituent single attribute.We speculate that compared to title, keywords, and organization which have semantic correlations among papers, the author's name has more linguistic qualities.Thus, combining two disparity attributes result in performance degradation.The performance of the title improves when it is paired with keywords or organization, suggesting that a consistent attributes combination may better express semantic correlations.
(2) Combining title, keywords, and organization performs the best.Finally, the combination of title, keywords, and organization, represented by the purple bars, performs better than mixing all the attributes together, i.e., the blue bar.This suggests that adding more attributes without calibrating may result in noise and lower performance.
Real-time Name Disambiguation.We also adopt the RND contest winner method's implementation pipeline.In addition to the soft semantic feature analysis, we also explore how various paper attributes affect the performance of name disambiguation methods using hand-crafted features listed in Table 5.
Results.The results are shown in Figure 6(c) and 6(d).
(1) The soft semantic features share a similar trend on both tasks.
Regarding the soft semantic features, Figure 6(c) and 6(a) show that both tasks share a common trend: 1) the attributes of title, keywords, and organization perform well and 2) the combination of title, keywords, and organization performs better than just mixing all the considered features.This is expected because both tasks measure the agreements between papers and authors via the same soft semantic feature modality.In terms of the ad-hoc semantic features, shown in Figure 6(d), the author name is the most effective factor to determine the performance of algorithms.
(2) Mixing all attributes performs best.Surprisingly, the blue bar, which represents the performance of combining all features, outperforms other combination patterns, suggesting that despite falling into the semantic feature category, the ad-hoc feature characterization frameworks have different underlying biases than the soft one.

Relational Feature Importance
Empirically, the fields of the author name and venue show a greater relational dependency between papers.Moreover, the field of organization has both relational and semantic characteristics.Therefore, we build three relational edges between papers: CoAuthor, where two papers have a relationship only if they share the same author name; CoOrg, where two papers have a relationship only if they  share the same affiliations 9 ; CoVenue, where two papers have a relationship only if they are published in the same venue or journal.From-scratch Name Disambiguation.We also follow the implementation pipeline of the contest winner method to obtain the relational paper embeddings in the built rational graphs, while exploring the effects of different relational edges.Results.Fig. 6(b) presents the performance of using different relation types.The grey bars, which show that CoAuthor performs the best among the single relational types, suggest that the author name has more important relational information than the semantic information.CoVenue performs the worst because massive papers from various domains may be published in the same venue/journal.Combining all three features yields the best results when taking into account the mixed outcomes, represented by the yellow and purple bars, which is consistent with empirical findings from Section 4.1 that consistent attribute combinations can improve performance.

Feature Modality Importance
We explore how the semantic and relational features affect the effectiveness of disambiguation.We conduct a thorough examination about the combination patterns of multi-modal features to see which ones perform the best.For the SND task, we leverage the paper attributes of title, keywords, and organization as the soft semantic features.For the relational features, we adopt three relation types, i.e., CoAuthor, CoOrg, and CoVenue.For the RND task, in addition to the soft and ad-hoc semantic features used in Section 4.1, we build the heterogeneous ego-graph for each pair of the target paper and a candidate author in order to add relational features.Results.Table 2 shows the performance of single feature modalities and their combinations.(1) Mixing multi-modal features performs best.We observe the single modality, i.e., semantic or relational features, underperforms their combination patterns, i.e., SND-all and RND-all, indicating that the semantic and relational features are complementary to one another.However, for the RND task, the ad-hoc semantic features alone can compete with their combinations.The relational features make marginal improvements.That explains why the best contest approach in this task doesn't take advantage of relational features.Therefore, how to effectively incorporate relational aspects is still an open question.

Overall Evaluation
In this section, we compare the proposed SND-all and RND-all frameworks with existing state-of-the-art name disambiguation  Real-time Name Disambiguation.We adopt the following baselines, IUAD [18] is also employed to perform the RND task via reconstructing the collaboration network between newly-arrived papers and existing authors.CONNA [3] is an interaction-based model.The basic interactions are built between the token embeddings of two attributes, then different attributes matrices are aggregated as the paper-level interactions, and finally, the paper-level matrices are aggregated as author-level interactions, and CONNA+Ad-hoc.is also a combination methodology that incorporates hand-crafted features into CONNA framework introduced in [3].For fair comparisons, we leverage features used in Table 5. RND-all is also our proposed method based on the findings in Section 4.3.It adopts the soft and ad-hoc semantic features used in Section 4.1.It also builds heterogeneous ego-graphs as relational features.The two features are combined to make predictions.Other prevailing methods, such as Louppe et al. [20], Zhang et al. [41], Camel [43], etc, are empirically proven to be less powerful than the adopted baselines, and thus are ignored in the experiments. 10We only consider the baselines with the released code.Results.Table 3 and Table 4 demonstrate the performance of various name disambiguation methods on the two tasks.The proposed SND-all, RND-all, and the contest winner significantly outperform other baselines by 25.74∼28.10%pairwise-F1 and 2.35∼11.80%weighted-F1 respectively.The significant performance gap between our proposed method and baselines proposed in recent research sheds light on the capability of prevailing name disambiguation methods is still far from satisfactory, which also reflects the significance of the WhoIsWho benchmark.Moreover, our proposed simple yet effective methods slightly outperform the contest winner method, suggesting that our empirical factor analysis successfully captures the essential components that enhance the effectiveness of name disambiguation methods.

Performance in Realistic Cases
Papers in the WhoIsWho benchmark always contain rich information since annotators prefer to work on papers owning abundant attributes that provide helpful evidence to support their decisions.Unfortunately, online digital libraries always contain a lot of papers with sparse attributes, meaning that papers with multiple attributes are absent.Taking AMiner for example, almost half of the newlyarrived papers lack the attributes of organizations.To understand the online name disambiguation scenarios on these papers, we perform SND-all and RND-all on these sparse-attributes cases.
Results.The results are shown in Figure 7.Among these, the papers without author names perform worst, dropping 36.72%pairwise-F1 and 15.58% weighted-F1.The absence of the attribute of organizations or keywords also significantly degenerates the online performance of name disambiguation algorithms on both tasks by dropping 8.97∼10.51%pairwise-F1 and 4.28∼6.40%weighted-F1.The results indicate that the online name disambiguation scenario is even more sophisticated than what we show on WhoIsWho.We will update datasets with sparse attributes to encourage more real-world online name disambiguation scenarios in the future.

WHOISWHO TOOLKIT
By automating data loading, feature creation, model construction, and evaluation processes, the WhoIsWho toolkit is easy for researchers to use and let them develop new name disambiguation approaches.The overview of the toolkit pipeline is illustrated in Figure 9.The toolkit is fully compatible with PyTorch and its associated deep learning libraries, such as Hugging face [39].Additionally, the toolkit offers library-agnostic dataset objects that can be used by any other Python deep learning frameworks such as Tensorflow [1].To keep things simple, we concentrate on building a basic RND method using PyTorch shown in Listing 1.More details refer to https://github.com/THUDM/WhoIsWho.Disambiguating Arxiv Papers.We deploy the RND-all method implemented by our toolkit on AMiner to disambiguate daily papers from arXiv.org on-the-fly.A demo page is depicted in Figure 8.The details refer to Section A.8.We manually check the latest 100 disambiguation results reflecting that 90% assignments are accurate.

RELATED WORK
Here, we recall the prevailing name disambiguation datasets and the state-of-the-art name disambiguation algorithms.Name Disambiguation Datasets.The size of datasets heavily influences the performance of name disambiguation algorithms.To address the problem, the community has created a large number of name disambiguation datasets recently.Among them, several efforts directly harvest datasets from existing digital libraries, including PubMed [40,45], DBLP [14], etc. [17,37,46].However, the assignment mistakes, as shown in Figure 2, hamper the development of effective algorithms [4,44].Others attempt to manually label a small amount of data based on noisy data from existing databases to reduce data noises [11,13,20,24,26,30,32,34,38,47].Most of them, however, do not have sufficient instances, as shown in Figure 1.The detailed data statistics refer to Table 6.Some of them have restricted scopes, for example, SCAD-zbMATH [24] is customized for a mathematical domain.The fragile inductive bias affects the performance and generalization of name disambiguation methods that are trained on these datasets.Subramanian et al. [31] build a unified dataset via aggregating several small scales of datasets.However, the quality of constituents has not been checked.Practical Tasks & Algorithms.Most efforts focus on the SND task.Generally, they operate via three steps: blocking, paper similarity matching, and clustering.Backes [2] discusses the nameblocking step.Several works lay emphasis on paper similarity matching and clustering steps.Early attempts designed hand-crafted similarity metrics [5] to measure paper similarities.Then, researchers discover that constructing paper similarity graphs excels at learning high-order similarity [6,8,13,27,47].As for clustering steps, the clustering methods such as hierarchical agglomerative clustering and DBSCAN are adopted.Among them, DBSCAN is preferred by practitioners as there is no need to specify the cluster number.
The RND task, which aims to assign newly-arrived papers to existing authors, is a more practicable scenario for online academic systems, Besides adopted baselines, Qian et al. [26] predict the likelihood of a paper being written by a specific author via the attributes of coauthor and keyword.Pooja et al. [25] utilize dynamic graph embedding to model evolving graphs.Several works [18,42] further employ a probabilistic model for online paper assignments.
Inevitable cumulative errors will greatly affect the efficacy of name disambiguation algorithms.Thus, the IND task is vital to guarantee the reliability of academic systems.Unfortunately, the issue has not received much attention [4].
Previous methods are usually evaluated on diverse small-scale datasets, which hamper the development of the community.Thus, a large-scale benchmark, a regular leaderboard with comprehensive tasks, together with an easy-to-use toolkit for web-scale academic name disambiguation should be concerned.

CONCLUSIONS
This paper delivers WhoIsWho including a benchmark, a leaderboard, and a toolkit for web-scale academic name disambiguation.Specifically, the large-scale benchmark with high ambiguity enables the devising of robust algorithms.Sponsored contests with two tracks promote the advances of the name disambiguation community.A regular leaderboard is publicly available to keep track of recent advances.An easy-to-use toolkit is designed to allow end users to rapidly build their own algorithm and publish their results on a regular leaderboard that records recent advances.In summary, WhoIsWho is an ongoing, community-driven, open-source project.We also encourage contributions from the community.A.2 Data Organizations in WhoIsWho Benchmark.
To date, WhoIsWho released three versions of datasets, i.e., WhoIsWho-v1, -v2, and -v3, with one specified dataset, WhoIsWho-v3.1,for the IND task.Among them, v1,v2,v3 datasets have the same organizations with the SND and RND tasks.Here, we briefly review the data organizations.
A.2.1 WhoIsWho-v1/v2/v3.The datasets are organized into the format of a two-level dictionary, i.e., names-authors-papers as shown in Listing 2. The key of the first-level dictionary is author names and the value is author profiles with the "same name".The term "same name" refers to the ways to unify names using name blocking techniques [2,14], such as moving the last name to the first or preserving all name initials but the last name.For example, the variants of "Jing Zhang" are "Zhang Jing", "J.Zhang", and "Z.Jing".The author profiles are also organized as a dictionary with the key being author IDs and the value being paper IDs of the author.
For each paper, we collect the title, author names, organizations of all the authors, keywords, abstract, publication year, and venue (conference or journal) as its attributes.A toy example of the paper with ID "9PgiwDo7" is shown in Listing 4. A.2.2 WhoIsWho-v3.1 for the IND task.WhoIsWho-v3.1 is organized as a one-level dictionary, as the IND task aims to detect and remove the error papers within each author.The key of the dictionary is author IDs, and the value is the papers belonging to the author, i.e., the normal data, and the manually detected error papers, i.e., the outliers.A demo case is present in Listing 3.

A.3 Data Split of WhoIsWho Contest
We claim the process of splitting the WhoIsWho datasets into training, validation, and test sets for the contest of three tasks, i.e., SND, RND, and IND, respectively.From-scratch Name Disambiguation.The SND task targets at partitioning papers of the author's name into different groups.Each group contains papers from the same author while papers in different groups belong to different authors.Thus, we first split the datasets into training, validation, and test set via the level of author names following specific ratios.Then for the validation and test sets, we delete the authorships between authors and papers in each name as shown in Listing 5. Researchers should correctly cluster papers belonging to the same author into the same group.
Real-time Name Disambiguation.The RND task aims at assigning newly-arrived papers to existing authors.Thus, firstly, we also split the datasets into training, validation, and test sets via the level of author names following specific ratios.Then for the validation and test sets, we sort papers within each author via the published year in ascending order.To simulate the real RND scenario, we treat the latest papers as the new-arrived unassigned papers and the remains as existing author profiles in each author, as shown in Listing 6.We also add several NIL papers, i.e., papers that can not be assigned to any existing author profiles, to the unassigned papers.Researchers need not only correctly assign papers to the right author, but also to distinguish NIL papers.Incorrect Assignment Detection.The IND task is designed for detecting and removing the error papers within each author.Concretely, we construct the dataset of the IND task as follows: 1) as illustrated in Table 1, the overall data annotation pipeline includes a 'Clean' step, during which annotators remove or split obviously incorrect papers from the concerned author.The subsequent 'Validate' step allows annotators to perform the same 'Clean' function on incorrectly assigned papers that are more difficult to identify.These two stages provide a sufficient number of incorrectly assigned papers to be detected in the IND task.2) Some authors manually maintain their profiles, such as adding new papers or removing papers that do not belong to them.We also collect the removed papers as targets for detection in the IND task.Thereby, we split the training, validation, and test sets via the author groups.Table 5 The detailed definitions of 36-dimensional hand-crafted features.: A.7.1 Baselines of the SND task.G/L-emb.Is a method for constructing paper-paper graphs using co-authorship connections.The process begins by generating initial paper embeddings through a weighted average of Word2Vec embeddings for all tokens within a paper.Next, the method fine-tunes these embeddings by first learning on a global paper-paper network and then adapting them to a local paper-paper network, specific to each author, using graph auto-encoding.Lastly, a hierarchical agglomerative clustering algorithm (HAC) is employed to segregate these papers into distinct groups.LAND.Is a method for constructing heterogeneous scholarly knowledge graphs (KGs) that encompass multiple entities, including papers, authors, venues, affiliations, and more.These KGs also feature various relations, such as co-authorship, and publication venues.Entity input embeddings are initialized using BERT models, and KG embedding techniques are then applied to derive paper and author embeddings.Finally, hierarchical agglomerative clustering (HAC) methods are utilized to group these entities based on the embeddings.IUAD.Is a method that constructs collaboration networks by treating papers as nodes and creating edges between two papers if they share the same author.It employs probabilistic generative methods to determine whether two papers in the collaboration network belong to a single author, with the goal of reconstructing the complete collaboration network accurately.Once the probabilistic models are trained, they are utilized to perform SND (Supervised Name Disambiguation) tasks, enhancing the overall disambiguation process.SND-all.Is a proposed baseline method based on empirical studies from contest-winning approaches.It starts by estimating semantic correlations among papers to be disambiguated, using title, keywords, and organizations as soft semantic features.
For each paper, SND-all projects these features into corresponding word embeddings via Word2Vec and averages them to create paper embeddings.Cosine similarities are then calculated between papers based on these semantic embeddings.Additionally, SNDall constructs a heterogeneous network featuring three relational edges among papers: co-author, co-organization, and co-venue.The method employs metapath2vec to generate relational embeddings of papers, and computes relational similarity scores based on these embeddings.Lastly, SND-all combines both multi-modal similarities to calculate overall similarities among papers.DBSCAN is used to derive the final clustering results based on these similarities.Contest Winner Method: Follows a similar pipeline to SND-all.However, it uses all paper attributes to estimate semantic correlations among papers, which has proven to be less effective than utilizing a few informative attributes, such as title, keywords, and organizations, as implemented in the SND-all method.Furthermore, when estimating relational similarities among papers, the Contest

P 5 <
l a t e x i t s h a 1 _ b a s e 6 4 = " L i 9 F / A R Y U q g 1 S 5 H i 5 2 I o o C z z 3 F k = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m 0 q M e C F 4 8 V 7 Q e 0 o W y 2 m 3 b p Z h N 2 J 0 I J / Q l e P C j i 1 V / k z X / j t s 1 B W x 8 M P N 6 b Y W Z e k E h h 0 H W / n c L a + s b m V n G 7 t L O 7 t 3 9 Q P j x q m T jV j D d Z L G P d C a j h U i j e R I G S d x L N a R R I 3 g 7 G t z O / / c S 1 E b F 6 x E n C / Y g O l Q g F o 2 i l h 0 b / s l + u u F V 3 D r J K v J x U I E e j X / 7 q D W K W R l w h k 9 S Yr u c m 6 G d U o 2 C S T 0 u 9 1 P C E s j E d 8 q 6 l i k b c + N n 8 1 C k 5 s 8 q A h L G 2 p Z D M 1 d 8 T G Y 2 M m U S B 7 Y w o j s y y N x P / 8 7 o p h j d + J l S S I l d s s S h M J c G Y z P 4 m A 6 E 5 Q z m x h D I t 7 K 2 E j a i m D G 0 6 J R u C t / z y K m l d V L 2 r q n d f q 9 R r e R x F O I F T O A c P r q E O d 9 C A J j A Y w j O 8 w p s j n R f n 3 f l Y t B a c f O Y Y / s D 5 / A H P e 4 1 y < / l a t e x i t > P3 < l a t e x i t s h a 1 _ b a s e 6 4 = " P n A G w L N O 5 6 s n 5 A v / H k y Y P T + 7 K h M = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l K U Y 8 F L x 4 r 2 g 9 o Q 9 l s J + 3 S z S b s b o Q S + h O 8 e F D E q 7 / I m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g o b m 1 v b O 8 X d 0 t 7 + w e F R + f i k r e N U M W y x W M S q G 1 C N g k t s G W 4 E d h O F N A o E d o L J 7 d z v P K H S P J a P Z p q g H 9 G R 5 C F n 1 F j p o T m o D c o V t + o u Q N a J l 5 M K 5 G g O y l / 9 Y c z S C K V h g m r d 8 9 z E + B l V h j O B s 1 I w 8 u I Y 6 3 E E D m s B g C M / w C m + O d F 6 c d + d j 0 V p w 8 p l j + A P n 8 w f S g 4 1 0 < / l a t e x i t > P5 < l a t e x i t s h a 1 _ b a s e 6 4 = " 2 u Z N 3 i 9 / v C f N r K C S 3 j b y / r s r Y U E = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k V I 8 F L x 4 r 2 g 9 o Q 9 l s N + 3 S z S b s T o Q S + h O 8 e F D E q 7 / I m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I J H C o O t + O 4 W N z a 3 t n e J u a W / / 4 P l e 4 c 2 R z o v z 7 n w s W w t O P n M K f + B 8 / g D Q / 4 1 z < / l a t e x i t > P4 < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 t M 7 P F X 4 4 6 K O u y / q U 7 N r q 8 R F 0 4 4 = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k V I 8 F L x 4 r 2 g 9 o Q 9 l s J + 3 S z S b s b o Q S + h O 8 e F D E q 7 / I m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g o b m 1 v b O 8 X d 0 t 7 + w e

Figure 2 :
Figure 2: Illustration of the challenges for annotating authors with the name "Yang Yang".Paper  5 is incorrectly assigned because of the coauthorship with the same third person.Two authors are mistakenly separated due to the organization shift.

Figure 3 (
a) illustrates the distribution of paper publication date.Few scientific documents were t e x i t s h a 1 _ b a s e 6 4 = " 5 t M 7 P F X 4 4 6 K O u y / q U 7 N r q 8 R F 0 4 4 = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k V I 8 F L x 4 r 2 g 9 o Q 9 l s J + 3 S z S b s b o Q S + h O 8 e F D E q 7 / I m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g o b m 1 v b O 8 X d 0 t 7 + w e F R + f i k r e N U M W y x W M S q G 1 C N g k t s G W 4 E d h O F N A o E d o L J 7 d z v P K H S P J a P Z p q g H 9 G R 5 C F n 1 F j p o T n w B u W K W 3 U X I O v E y 0 k F c j Q H 5 a / + M G Z p h N I w Q b X u e W 5 i / I w q w 5 n A W a m f a k w o m 9 A R 9 i y V N E L t Z 4 t T Z + T C K k M S x s q W N G S h / p 7 I a K T 1 N A p s Z 0 T N W K 9 6 c / E / r 5 e a 8 M b P u E x S g 5 I t F 4 W p I C Y m 8 7 / J k C t k R k w t o U x x e y t h Y 6 o o M z a d k g 3 B W 3 1 5 n b S v q l 6 9 6 t 3 X K o 1 a H k c R z u A c L s G D a 2 j A H T S h B Q x G 8 A y v 8 O Y I 5 8 V 5 d z 6 W r Q U n n z m F P 3 A + f w D M c 4 1 w < / l a t e x i t > 1 _ b a s e 6 4 = " P n A G w L N O 5 6 s n 5 A v / H k y Y P T + 7 K h M = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l K U Y 8 F L x 4 r 2 g 9 o Q 9 l s J + 3 S z S b s b o Q S + h O 8 e F D E q 7 / I m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g o b m 1 v b O 8 X d 0 t 7 + w e w 8 u I Y 6 3 E E D m s B g C M / w C m + O d F 6 c d + d j 0 V p w 8 p l j + A P n 8 w f S g 4 1 0 < / l a t e x i t > s D 5 / A H P e 4 1 y < / l a t e x i t > P3 < l a t e x i t s h a 1 _ b a s e 6 4 = " P n A G w L N O 5 6 s n 5 A v / H k y Y P T + 7 K h M = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l K U Y 8 F L x 4 r 2 g 9 o Q 9 l s J + 3 S z S b s b o Q S + h O 8 e F D E q 7 / I m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g o b m 1 v b O 8 X d 0 t 7 + w e w 8 u I Y 6 3 E E D m s B g C M / w C m + O d F 6 c d + d j 0 V p w 8 p l j + A P n 8 w f S g 4 1 0 < / l a t e x i t > P5 < l a t e x i t s h a 1 _ b a s e 6 4 = " Y + 3 4 w 2 a A J 9 g p 2 y w I u 3 x f B 3 N t n W I = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k V I 8 F L x 4 r 2 g 9 o Q 9 l s N + 3 S z S b s T o Q S + h O 8 e F D E q 7 / I m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I J H C o O t + O 4 W N z a 3 t n e J u a W / / 4 P l w h k 9 S Y n u c m 6 G d U o 2 C S z 0 r 9 1 P C E s g k d 8 Z 6 l i k b c + N n i 1 B m 5 s M q Q h L G 2 p Z A s 1 N 8 T G Y 2 M m U a B 7 Y w o j s 2 q N x f / 8 3 o p h j d + J l S S I l d s u S h M J c G Y z P 8 m Q 6 E 5 Q z m 1 h D I t 7 K 2 E j a m m D G 0 6 J R u C t / r y O m l f V b 1 6 1 b u v V R q 1 P I 4 i n M E 5 X I I H 1 9 C A O 2 h C C x i M 4 B l e 4 c 2 R z o v z 7 n w s W w t O P n M K f + B 8 / g D U B 4 1 1 < / l a t e x i t > P6 < l a t e x i t s h a 1 _ b a s e 6 4 = " 2 u Z N 3 i 9 / v C f N r K C S 3 j b y / r s r Y U E = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k V I 8 F L x 4 r 2 g 9 o Q 9 l s N + 3 S z S b s T o Q S + h O 8 e F D E q 7 / I m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I J H C o O t + O 4 W N z a 3 t n e J u a W / / 4 P C o f H z S N n G q G W + x W M a 6 G 1 D D p V C 8 h Q I l 7 y a a 0 y i Q v B N M b u d + 5 4 l r I 2 L 1 i N O E + x E d K R E K R t F K D 8 1 B b V C u u F V 3 A b J O v J x U I E d z U P 7 q D 2 O W R l w h k 9 S Y n u c m 6 G d U o 2 C S z 0 r 9 1 P C E s g k d 8 Z 6 l i k b c + N n i 1 B m 5 s M q Q h L G 2 p Z A s 1 N 8 T G Y 2 M m U a B 7 Y w o j s 2 q N x f / 8 3 o p h j d + J l S S I l d s u S h M J c G Y z P 8 m Q 6 E 5 Q z m 1 h D I t 7 K 2 E j a m m D G 0 6 J R u C t / r y O m l f V b 1 6 1 b u v V R q 1 P I 4 i n M E 5 X I I H 1 9 C A O 2 h C C x i M 4 B l e 4 c 2 R z o v z 7 n w s W w t O P n M K f + B 8 / g D Q / 4 1 z < / l a t e x i t > P4 < l a t e x i t s h a 1 _ b a s e 6 4 = " Y f q P 2 M B d q 2 P 4 D p f y T Q H t Z z g M P R 0 = " > A A A B 6 n i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e x K f B w D X j x G N A 9 I l j A 7 m U 2 G z M 4 u M 7 1 C W P I J X j w o 4 t U v 8 u b f O E n 2 o I k F D U V V N 9 1 d Q S K F Q d f 9 d g p r 6 x u b W 8 X t 0 s 7 u 3 v 5 B + f C o Z e J U M 9 5 k s Y x 1 J 6 C G S 6 F 4 E w V K 3 k k 0 p 1 E g e T s Y 3 8 7 8 9 h P X R s T q E S c J 9 y M 6 V C I U j K K V H h r 9 y 3 6 5 4 l b d w 8 u I Y 6 3 E E D m s B g C M / w C m + O d F 6 c d + d j 0 V p w 8 p l j + A P n 8 w f S g 4 1 0 < / l a t e x i t > P5 < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 t M 7 P F X 4 4 6 K O u y / q U 7 N r q 8 R F 0 4 4 = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k V I 8 F L x 4 r 2 g 9 o Q 9 l s J + 3 S z S b s b o Q S + h O 8 e F D E q 7 / I m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g o b m 1 v b O 8 X d 0 t 7 + w e F R + f i k r e N U M W yx W M S q G 1 C N g k t s G W 4 E d h O F N A o E d o L J 7 d z v P K H S P J a P Z p q g H 9 G R 5 C F n 1 F j p o T n w B u W K W 3 U X I O v E y 0 k F c j Q H 5 a / + M G Z p h N I w Q b X u e W 5 i / I w q w 5 n A W a m f a k w o m 9 A R 9 i y V N E L t Z 4 t T Z + T C K k M S x s q W N G S h / p 7 I a K T 1 N A p s Z 0 T N W K 9 6 c / E /r 5 e a 8 M b P u E x S g 5 I t F 4 W p I C Y m 8 7 / J k C t k R k w t o U x x e y t h Y 6 o o M z a d k g 3 B W 3 1 5 n b S v q l 6 9 6 t 3 X K o 1 a H k c R z u A c L s G D a 2 j A H T S h B Q x G 8 A y v 8 O Y I 5 8 V 5 d z 6 W r Q U n n z m F P 3 A + f w D M c 4 1 w < / l a t e x i t > P1 < l a t e x i t s h a 1 _ b a s e 6 4 = " L i 9 F / A R Y U q g 1 S 5 H i 5 2 I o o C z z 3 F k = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m 0 q M e C F 4 8 V 7 Q e 0 o W y 2 m 3 b p Z h N 2 J 0 I J / Q l e P C j i 1 V / k z X / j t s 1 B W x 8 M P N 6 b Y W Z e k E h h 0 H W / n c L a + s b m V n G 7 t L O 7 t 3 9 Q P j x q m T j V j D d Z L G P d C a j h U i j e R I G S d x L N a R R I 3 g 7 G t z O / / c S 1 E b F 6 x E n C / Y g O l Q g F o 2 i l h 0 b / s l + u u F V 3 D r J K v J x U I E e j X / 7 q D W K W R l w h k 9 S Y r u c m 6 G d U o 2 C S T 0 u 9 1 P C E s j E d 8 q 6 l i k b c + N n 8 1 C k 5 s 8 q A h L G 2 p Z D M 1 d 8 T G Y 2 M m U S B 7 Y w o j s y y N x P / 8 7 o p h j d + J l S S I l d s s S h M J c G Y z P 4 m A 6 E 5 Q z m x h D I t 7 K 2 E j a i m D G 0 6 J R u C t / z y K m l d V L 2 r q n d f q 9 R r e R x F O I F T O A c P r q E O d 9 C A J j A Y w j O 8 w p s j n R f n 3 f l Y t B a c f O Y Y / s D 5 / A H P e 4 1 y < / l a t e x i t > P3 < l a t e x i t s h a 1 _ b a s e 6 4 = " P n A G w L N O 5 6 s n 5 A v / H k y Y P T + 7 K h M = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l K U Y 8 F L x 4 r 2 g 9 o Q 9 l s J + 3 S z S b s b o Q S + h O 8 e F D E q 7 / I m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g o b m 1 v b O 8 X d 0 t 7 + w e F R + f i k r e N U M W y

Definition 1 .
Paper.A paper  is associated with multiple fields of attributes, i.e.,  = { 1 , • • • ,   }, where   ∈  represents the  -th attribute. is the number of attributes.Definition 2. Author.An author  is comprised of a set of papers, i.e.,  = { 1 , • • • ,   }, where each paper   = { 1 , • • • ,   } and  is the number of papers authored by .Definition 3. Candidate Papers.Given a person name denoted by , P  = {  1 , . . .,    } is a set of candidate papers written by any author with the name .

Figure 5 :
Figure 5: The released time of WhoIsWho benchmark and launched contests.
Ad-hoc semantic features of RND.

Figure 6 :
Figure 6: Feature importance on the SND and RND tasks.

Figure 8 :
Figure 8: A demo about disambiguating daily papers from arXiv.org.

Figure 9 :
Figure 9: Overview of the WhoIsWho toolkit pipeline.(a) WhoIsWho provides the large-scale benchmark with high ambiguity and large quantity.(b) The WhoIsWho toolkit automates dataset processing and splitting.That is, the data loader automatically loads arbitrary versions of datasets, and further split the datasets in a standardized manner.(c) WhoIsWho toolkit provides flexible modules for feature creation including semantic features characterization and relational graph construction, based on that (d) researchers can adopt models pre-defined in the toolkit library for training and prediction.Moreover, (e) researchers can build their own feature processing process and develop ML models.(f) WhoIsWho evaluates the model in a task-dependent manner and outputs the model performance on the validation set.Finally, (g) WhoIsWho provides public leaderboards to keep track of recent advances.

Figure 9
Figure 9 demonstrates the overview of the WhoIsWho toolkit pipeline.A toy example of building basic RND algorithms is shown in Listing 1.
. Ego graph of the author (b).Ego graph of the paper

Figure 11 :
Figure 11: The built ego graph on the target paper and the candidate author in the RND-all method.

Figure 12 :
Figure 12: A toy annotation example for annotating authors with author name "Andrea Rossi".

Table 1 :
Data annotation pipeline.For operations performed via three annotators, major voting is applied to solve conflicts.Anno. is the abbreviation of annotators.

Table 2 :
Performance (%) of different feature modalities (semantic or relational) and their combinations.