Overlapping and Robust Edge-Colored Clustering in Hypergraphs

A recent trend in data mining has explored (hyper)graph clustering algorithms for data with categorical relationship types. Such algorithms have applications in the analysis of social, co-authorship, and protein interaction networks, to name a few. Many such applications naturally have some overlap between clusters, a nuance which is missing from current combinatorial models. Additionally, existing models lack a mechanism for handling noise in datasets. We address these concerns by generalizing Edge-Colored Clustering, a recent framework for categorical clustering of hypergraphs. Our generalizations allow for a budgeted number of either (a) overlapping cluster assignments or (b) node deletions. For each new model we present a greedy algorithm which approximately minimizes an edge mistake objective, as well as bicriteria approximations where the second approximation factor is on the budget. Additionally, we address the parameterized complexity of each problem, providing FPT algorithms and hardness results.


INTRODUCTION
Graph clustering is a fundamental problem in data mining, relevant whenever the dataset captures relationships between entities.A recent trend has focused on clustering hypergraphs [20-22, 24, 31-34], which better capture multiway relationships such as group social interactions, group email correspondence, and academic coauthorship.A second trend has emphasized clustering on edgecolored (hyper)graphs, where the color of a (hyper)edge indicates an interaction of a certain type or category [5-7, 11, 28, 37, 38].
Given a (hyper)graph with edge categories (modeled by colors), categorical clustering seeks to cluster nodes in such a way that node clusters tend to match with edge categories.This type of clustering is natural in many different settings.For example, a hyperedge can be used to represent a group of researchers who co-author a paper together, and edge colors can denote paper discipline or venue (e.g., WSDM or SIGCOMM) [5,12].In this context, categorical clustering provides a way to infer authors' research areas based on publication history.Edge-colored hypergraphs can also be used to represent sets of ingredients (nodes) in the same food recipe (hyperedge), with edge color indicating cuisine type (e.g., Korean food or French cuisine) [5,28,38].Categorical clustering then provides a way to identify groups of ingredients that are frequently used together in the same cuisine type.Edge colors can also represent discrete time windows in temporal networks [5], in which case categorical clustering can be used for temporal clustering, e.g., to identify groups of users in an online social network that are active in the same time period [5].Existing tools for categorical clustering have also been used to cluster biological networks (where colors indicate gene interaction types) [12], social networks such as Facebook and Twitter (where colors indicate relationship types) [28,38], and online vacation rentals based on user browsing history (where hyperedges are groups of rentals browsed during the same user session and colors indicate the location of the user) [37].
This paper addresses two limitations of previous research.First, existing categorical clustering models do not allow for overlap between clusters.However, cluster overlap is usually necessary to accurately model the application in question.Researchers publish in multiple fields and often are identified with multiple areas of expertise.Many food ingredients are common across a wide range of different cuisines.Users on social media can be very active in more than one discrete time window.The second limitation is that existing techniques do not allow for any notion of robustness to noise among the nodes.This may be necessary to achieve accurate clusterings either in applications where the data itself is noisy, or in applications where the nature of the network under consideration makes it impossible to accurately restrict the cluster memberships of certain nodes.For example, a reasonable clustering of ingredients by cuisine should not restrict nearly universal ingredients such as salt and flour to one or even a few clusters.Our goal is to provide the first clustering algorithms to simultaneously (i) respect high-order relationships (hyperedges), (ii) provide a categorical clustering according to discrete edge labels, and (iii) contain a mechanism for allowing overlapping clusters or robustness to noisy data points.
Modeling overlap and robustness.We accomplish this goal through the lens of Edge-Colored Clustering (ECC), a recent categorical clustering framework [9].Given an edge-colored hypergraph, ECC asks for assignments of colors to nodes such that, as much as possible, edges contain nodes which match their color.Formally, a solution makes a mistake at edge  of color  if any node in  is not assigned color .The goal is to produce an assignment of colors to nodes which minimizes mistakes.We define three new generalizations of this framework.These all have the same objective function as ECC (minimize edge mistakes), but each comes with a parameter  that defines a budget for different ways of assigning additional colors to nodes.
• Locally-Budgeted Overlapping-ECC (Local ECC) seeks to minimize the number of edge mistakes, while assigning up to  ≥ 1 colors to each node.• Globally-Budgeted Overlapping-ECC (Global ECC) allows one "free" color assignment per node, and additionally has  ≥ 0 extra node-color assignments that are shared among all nodes.• Robust ECC also allows one "free" color assignment per node, and then up to  ≥ 0 nodes are given every color.Equivalently, up to  nodes are deleted and hyperedges are contracted to contain only non-deleted nodes.
These models capture many different settings.Local ECC is arguably the most natural for academic co-authorship datasets, where each researcher can be associated with a small number of areas of expertise.Global ECC and Robust ECC are natural when many nodes can be associated with one category, but a few nodes should be associated with many categories (or transcend categories in the case of Robust ECC).For example, in clustering food ingredients, some ingredients are common in many cuisines (e.g., soy sauce or saffron) and some ingredients are so common they are in every category (e.g., salt).It is also worth noting that Global ECC is in some ways more flexible than Local ECC, since for a large enough budget  one can always choose to give every node a small number of colors if that is the best fit for the dataset.As we shall see, this flexibility can sometimes lead to differences in theoretical results.
Each of these problems generalizes ECC, with equivalence for the first coming at  = 1 and for the latter two at  = 0. Consequently, they are each NP-hard [9], motivating the study of approximations and parameterized algorithms.
Approximations.In Section 4 we present an  -approximation on the number of edge mistakes for each objective, where  is the maximum hyperedge size, using a greedy approach for assigning colors.Moreover, for Local ECC we present a simple LP-rounding scheme which results in a ( + 1)-approximate solution.In practice one might expect that flexibility should be allowed not just on the solution size, but also on the budget used to achieve the solution.We therefore also study bicriteria (, )-approximations, where  is the approximation factor on edge mistakes and  is the approximation factor on the budget .For Local ECC and Robust ECC we present algorithms with constant  and , while for Global ECC we give an algorithm with constant  and  linear in .
Parameterized algorithms.In Section 5 we study the decision variants, each of which asks whether it is possible to find a clustering which makes no more than  mistakes.We give a complete characterization of the parameterized complexity of our problems in terms of solution size by showing that each is fixed parameter tractable (FPT) 1 in  + , but W [1]-hard in  and para-NP-hard in .This implies (under standard assumptions) that they are not FPT with respect to  or  individually.
Empirical analysis.In Section 6 we present experimental results on datasets from the application domains motivated above, as well as several others.In practice, our approximation algorithms tend to significantly outperform their theoretical guarantees.

PRELIMINARIES
Let  = ( , , ℓ) denote an edge-colored hypergraph with node set  and colored (hyper)edge set .Throughout the text,  denotes the number of colors, and  is the rank of the hypergraph, i.e., the maximum hyperedge size.The function ℓ :  → [] = {1, 2, . . .,  } labels each edge with a color, and   ⊆  denotes the set of edges of color  ∈ [].For  ∈  , we let  () ⊆  denote the set of edges incident to  and   = | ()| be its degree.We say that color  is incident to node  if  is contained in at least one edge in   .The chromatic degree    of  is the number of colors incident to .The goal of all ECC problems we consider is to color nodes in a way that correlates as much as possible with edge colors, subject to different node-color constraints.To accommodate multiple color assignments, we consider maps  :  → 2 [ ] from nodes to the power set 2 [ ] of colors, so () ⊆ [] represents the set of colors assigned to node .Map  makes a mistake at  ∈  if there exists a node  ∈  such that ℓ () ∉ (), and we say the edge is unsatisfied.Otherwise the edge is satisfied by .The set of edges where  makes a mistake is denoted M  ⊆ .Let 1 denote a binary indicator function so that 1( ∈ M  ) is 1 if  makes a mistake at  and is 0 otherwise.The optimization versions of Local ECC, Global ECC, and Robust ECC seek to minimize the number of mistakes, One can equivalently define all of these objectives as edge deletion problems, where the goal is to minimize the number of edges to delete in order to satisfy all remaining edges (subject to any constraints on ).One can define a maximization variant for all of these ECC objectives, where the goal is to maximize the number of satisfied edges rather than minimize the number of mistakes.This is equivalent at optimality but different in terms of approximations and parameterized complexity.
In this paper we focus on the mistake minimization objective.

RELATED WORK
Angel et al. [9] introduced the maximization variant of ECC in graphs and showed that it is NP-hard in general but polynomialtime solvable when  = 2.The approximability of the maximization variant on graphs has attracted considerable interest [2][3][4].Amburg et al. [5] generalized ECC to hypergraphs, and studied the approximability of the minimization variant.Veldt [37] improved the best known approximation factor for this variant to min{2(1 − 1  ), 2(1 − 1  +1 )}.Cai and Leung [14] showed that ECC in graphs is FPT with respect to both satisfied and unsatisfied edges.Kellerhals et al. [27] gave improved parameterized algorithms for both graphs and hypergraphs, and showed fixed-parameter (in)tractability for a variety of structural parameters.
ECC is related to Chromatic Correlation Clustering [7,11,28,38], which generalizes Correlation Clustering [10] to edgecolored graphs.Other variants of Correlation Clustering allow overlapping clusters [8,13,31].However, none of these apply simultaneously to (i) hypergraphs, (ii) categorical interactions, and (iii) overlapping clusters.Aboud [1] defined a Correlation Clustering variant in which any node can be discarded for a price given by a penalty function  :  → R. Devvrit et al. [18] studied the case of a constant penalty function and added a budget for the total number of discarded nodes, providing a bicriteria approximation.We adopt their notion of robustness, but emphasize that the resulting algorithms and complexity analysis are distinct, because Correlation Clustering and ECC have fundamentally different clustering objectives.More generally, notions of robustness have been introduced in a variety of clustering frameworks [15,16,30], though to our knowledge never in combination with edge-colors.
We present approximation algorithms for overlapping and robust ECC, including greedy  -approximations for each objective and bicriteria approximations based on linear programming (LP).

Greedy algorithms
Our greedy  -approximation algorithms generalize the greedy algorithm for standard ECC developed by Amburg et al. [5].We first consider an alternate linear penalty for hyperedge mistakes, where the penalty at an edge  ∈  equals the number of nodes in  that are not assigned the color of .Formally, if () ⊆ [] is the set of colors assigned to , this penalty is given by ( For every  ∈  and node coloring , this linear penalty satisfies Thus, if we can compute λ = argmin   ∈  (, ), where we minimize over a desired class of coloring functions  (e.g., overlapping or robust), then λ will be within a factor  of the optimal clustering for the standard edge penalty.This applies in the same way for all overlapping and robust variants.It remains to show that we can find this optimal λ in polynomial time using a greedy approach.
For node  and edge  ∋ , if ℓ () ∉ () we say there is a nodeedge error at (, ).Each node in an edge contributes independently to the linear edge penalty, so the linear ECC objective function simply amounts to minimizing the number of node-edge errors: The greedy algorithm for minimizing this objective starts with an empty labeling  that assigns no color to every node, and then iteratively adds node-color assignments to greedily maximize the number of node-edge errors that are fixed at each step.To formalize this, for node  ∈  and color , let   () = | ()∩  | be the number of edges of color  that  belongs to.Let   be a permutation vector that arranges colors based on how often  participates in them: We think of   [] as node 's "-th favorite color".Figure 1 gives pseudocode for the greedy procedure for Local ECC, Global ECC, and Robust ECC.The fact that each method optimizes the linear objective function given in (4) for each variant of overlapping or robust ECC follows from the fact that node-edge errors are independent across different nodes.If node  is assigned  colors, then these must be 's top  favorite colors, otherwise we could improve the objective by switching color assignments at that node.Furthermore, if we have a global budget of additional colors to assign (or a budget on the number of nodes that can be assigned all colors at once), it is optimal to choose nodes at each step for which additional color assignments lead to the greatest immediate improvement in the linear objective.We conclude with the following theorem.
Theorem 4.1.The greedy approach provides an  -approximation for the standard Local ECC, Global ECC, and Robust ECC objectives.
// find  whose next color choice fixes the most errors 4: // add 's next favorite color to its color list 6: // find  whose deletion fixes the most errors 4:

Linear Programming Algorithms
A bicriteria (, )-approximation algorithm is one that comes within a factor  of the optimal number of edge mistakes while violating budget constraints by a factor at most .Our greedy algorithms satisfy this guarantee with  =  and  = 1 (i.e., no budget violation), but the dependence on  in the approximation factor is not ideal.We now turn to algorithms based on linear programming relaxations, which have much better approximation factors, at the expense of going over budget constraints by a small amount.
Local ECC LP Algorithms.We present the following LP relaxation for Local ECC: If we replace constraints    ,   ∈ [0, 1] with the binary constraints    ,   ∈ {0, 1}, then this becomes an integer linear program (ILP) that exactly encodes Local ECC.The variable    can be thought of as the distance from node  to color , and the constraint     ≥  −  captures the fact that each node can be assigned at most  colors.The LP relaxation can be solved in polynomial time, and its optimal solution provides a lower bound on the optimal Local ECC solution.We can round the fractional LP into a clustering to provide a range of bicriteria approximations that trade off in terms of budget violation and objective function approximation.Theorem 4.2.Let {   ,   :  ∈ ,  ∈  } denote optimal LP variables for the LP given in (6).Let  ∈ (0, 1) be any threshold such that / is an integer, and let  encode a coloring where node This coloring is a bicriteria Proof.For each node  ∈  , this rounding scheme will assign at most / − 1 =  (1/ − 1/) colors to each node, which implies the overlap constraint is violated by at most a factor (1/ − 1/).Assuming this is not true implies there is some node  ∈  such that    < 1 −  for at least / choices of the color .Since    ≤ 1 for every  ∈ [], we reach the following contradiction to LP feasibility: Making a mistake at  ∈  means some  ∈  was not assigned color  = ℓ (), so   ≥    ≥ 1 − .Since the cost for each edge is within a factor 1/(1 − ) of the LP variable for this edge, the total cost of  is within a factor 1/(1 − ) of the LP lower bound.□ It is worth noting that if we set  = /( + 1), this leads to a single-criteria ( + 1)-approximation algorithm for Local ECC that directly generalizes the 2-approximation for standard ECC (Local ECC with  = 1).Another significant choice of parameter is when  is chosen to be a constant, in which case we obtain bicriteria (, ) -approximations where both  and  are constants.For example, when  = 1/2, we have a (2, 2 − 1/)-approximation.
Global ECC LP Algorithms.For Global ECC we give the following LP relaxation: To see why this is a relaxation, note that if {   ,   } variables were constrained to be binary, we would recover the ILP for Global ECC, where the variable   is an integer encoding the number of additional colors assigned to node  beyond its first color assignment.Solving the relaxation produces nonnegative   values that are generally not integers.For our LP rounding procedure, we define the following function for  ≥ 0 and threshold  ∈ (0, 1): The following useful properties hold for every  ≥ 0: Theorem 4.3.Let {   ,   ,   :  ∈ ,  ∈  } denote optimal LP variables for the LP given in (7).For a threshold  ∈ (0, 1), let   = 1− ⌊  ⌉  +2 for each  ∈  , and let  encode a coloring where node  ∈  is assigned color  ∈ [] if    <   .This coloring is a bicriteria Proof.This rounding scheme will assign at most ⌊  ⌉  + 1 colors to node .If not, then    <   for at least ⌊  ⌉  + 2 choices of the color , so using the inequality in (9) we reach a contradiction to the LP constraints: Therefore, since each node is assigned one color without using any of the budget , property (11) shows this rounding scheme assigns an extra  ⌊  ⌉  ≤ 1     ≤ 1   color assignments, violating the budget by at most a factor 1/.
If we make mistake at  ∈ , then ≥   for some  ∈  , so this mistake is within a factor 1 of the LP lower bound for that edge.□ By setting  = 1 2 in the above theorem, we obtain a bicriteria (2 + 5, 2)-approximation.We can also come arbitrarily close to satisfying budget constraints as long as we are willing to make the approximation to the objective function worse.In general, the fact that the overlap budget is shared by all nodes makes it more challenging to obtain good approximations for Global ECC.Unlike our LP algorithms for Local ECC, there is no choice of threshold  that leads to a single-criteria approximation algorithm (i.e., no violation of the budget), nor a setting that produces a constant-constant bicriteria approximation for Global ECC.Exploring alternate rounding techniques and relaxations that achieve these goals is therefore a natural direction for future research.
Robust ECC LP Algorithms.Finally, we present the following LP relaxation for Robust ECC: Similar to our LPs for overlapping ECC, constraining variables {   ,   ,   } to be binary produces an ILP that exactly encodes Robust ECC.The variable   , when binary, encodes whether node  should be deleted (  = 1) or not (  = 0).If we delete this node, then every edge  ∋  is not violated even if  is not assigned color ℓ (), which is captured by the constraint    −   ≤   .
Theorem 4.4.Let {   ,   ,   :  ∈ ,  ∈  } denote optimal LP variables for the LP in (12).For  ∈ (0, 1  2 ), let  encode a coloring where  ∈  is deleted if   ≥  and is given color Proof.The fact that  is given color  only if    < 1 2 implies that each node will be assigned at most one color.If  makes a mistake at an edge  ∈ , this means that there is some  ∈  that is not deleted and is not given color ℓ (), so .
Thus, we can pay for all edge mistakes within a factor 2/(1 − 2) of the LP lower bound.Define ẑ = 1 if we delete node  (i.e.,   ≥ ) and let ẑ = 0 otherwise.Observe that ẑ ≤ (1/)  , so the number of deleted nodes is  ẑ ≤ 1     ≤ 1   which proves the desired bound on the budget violation.□ If we set  to a small constant, we can get constant-constant bicriteria approximation guarantees.For example,  = 1/3 produces a (6, 3)-approximation and  = 1/4 gives a (4, 4)-approximation.It is worth noting that it is actually impossible to obtain a singlecriteria approximation for Robust ECC by rounding the LP relaxation.To see why, consider a hypergraph  with four nodes  = { 1 ,  2 ,  3 ,  4 } and two hyperedges  1 = { 1 ,  2 ,  3 } and  2 = { 2 ,  3 ,  4 } with unique colors 1 and 2. When  = 1, the optimal Robust ECC solution makes one mistake, but the LP relaxation obtains an objective score of 0 by setting  1 Because every labeling that respects the budget will make at least one mistake, the ratio between the number of mistakes and the LP lower bound is infinite.

PARAMETERIZED COMPLEXITY
Our parameterized complexity results focus on the decision versions of Local ECC, Global ECC, and Robust ECC.The input in these cases is an edge-colored hypergraph  = ( , , ℓ) and two nonnegative integers  and .Local ECC asks whether there exists a subset  ⊆  of size | | =  such that every edge in  \  can be satisfied by assigning at most  colors to each node.Decision versions for Global ECC and Robust ECC are defined similarly using their corresponding constraints on .We use a tuple (, , ) to denote an instance of a decision problem.
All three problems are XP in  (i.e., admit an  (| |  ( ) ) algorithm).One such algorithm exhaustively tries every way to remove  edges and checks if coloring (or removing for Robust ECC) the nodes to satisfy the remaining edges violates the budget.

W[1]-Hardness
Local ECC, Global ECC, and Robust ECC are all para-NP-hard (implying W[1]-hardness) with respect to , since standard ECC is already NP-hard (i.e.,  = 0 for Global ECC and Robust ECC, and  = 1 for Local ECC).This connection also rules out FPT algorithms in various other potentially interesting parameters, including ,  , max degree, vertex-cover number, and treewidth [27].Here, we show that each of our problems is W [1]-hard in the natural parameter .For Local ECC, we give a reduction from Set Cover, which is W [2]-hard with respect to the size of the cover [19].W [2]-hardness implies W [1]-hardness by definition [19].Proof.Given an instance (U, F , ) of Set Cover, let (, ,  =  − 1) be the following instance of Local ECC.For each element   ∈ U, add the node   to  .For each set   ∈ F , add the node   to  .Additionally, add the hyperedge   with color  to  containing   and the vertices {  :   ∈   }.Finally, for each node in {  :   ∉   }, add a hyperedge containing   and   with color .Note that each node   participates in  hyperedges with  different colors.We argue that (, , ) is a yes-instance of Local ECC if and only if (U, F , ) is a yes-instance of Set Cover.
First, assume that (U, F , ) is a yes-instance of Set Cover, and let  ⊆ F be a feasible solution.We claim that  = {  :   ∈  } is a feasible solution for (, , ).By construction, | | ≤ .Consider whether the vertices of  can be assigned colors to satisfy the remaining hyperedges.Each node   is only contained in hyperedges of color  and thus only needs one color.Since  is a feasible set cover, each node   must be contained by a hyperedge in  .Thus,   participates in at most  − 1 hyperedges and only needs  =  − 1 colors.As a result,  is a feasible solution, and (, , ) is a yes-instance of Local ECC.Now, assume that (, , ) is a yes-instance of Local ECC, and let  be a feasible solution.Let  = {  :   ∈  }.For the remaining hyperedges in  of the form {  ,   }, add any   ′ to  such that   ∈   ′ .By construction, | | ≤ .Suppose that  is not a set cover, and there exists an uncovered element   ∈ U.By our choice of  , no hyperedges containing   appear in  .However, since   participates in  hyperedges with  different colors, assigning only  colors to   cannot satisfy all incident hyperedges.This contradicts that  was a feasible solution, and so  must also be feasible.Thus, (U, F , ) is a yes-instance of Set Cover.□ To prove W[1]-hardness for Global ECC and Robust ECC, we give a reduction from Partial Vertex Cover, which is W[1]-hard with respect to the size of the cover [23].

Input:
a graph  and non-negative integers  and  Problem: is there a set  of at most  vertices which cover at least  edges?Partial Vertex Cover Theorem 5.2.Global ECC is W[1]-hard with respect to .
Proof.Given an instance (, , ) of Partial Vertex Cover with  edges, construct a Global ECC instance (, ,  =  − ) by mapping edges to nodes and vertices to hyperedges.For each edge  ∈ , add the node   to  .For each vertex  ∈ , add the hyperedge   containing the nodes   corresponding to edges  incident to  in .Each hyperedge is given a unique color.Note that each node   participates in exactly two hyperedges.We argue that (, , ) is a yes-instance of Global ECC if and only if (, , ) is a yes-instance of Partial Vertex Cover.
First, assume (, , ) is a yes-instance of Partial Vertex Cover, and let  be a feasible solution.We claim that  = {  :  ∈  } is a feasible solution for (, , ).By construction, | | ≤ .In  \  , a node   only needs more than one color if the edge  is uncovered by  in .Since  covers at least  edges, there are at most  =  − of these nodes.Moreover, since each node needs at most two colors, we need at most  additional color assignments to satisfy every hyperedge.Therefore,  is a feasible solution, and (, , ) is a yes-instance of Global ECC.Now, assume (, , ) is a yes-instance of Global ECC, and let  be a feasible solution.Let  = { :   ∈  }.By construction, | | ≤ .Suppose that  is not a feasible partial vertex cover, and there exists a set  of  −  + 1 uncovered edges in .For every edge  = (, ) ∈  , neither   nor   are in  .Thus,   appears in hyperedges with two different colors, and an additional color assignment is necessary to satisfy both   and   .However, then we must use a global budget of at least − +1 >  colors, contradicting that  is a feasible solution.Therefore,  covers at least  edges, and (, , ) is a yes-instance of Partial Vertex Cover.□ Theorem 5.3.Robust ECC is W[1]-hard with respect to .
The proof proceeds identically to Theorem 5.2 since removing a node has the same effect as assigning a second color.

FPT Algorithms
Since Local ECC, Global ECC, and Robust ECC are not FPT with respect to either  or  individually, we now consider parameterizing by  +.We give branching algorithms which work by defining and resolving conflicts: nodes which cannot satisfy all of their incident hyperedges without deleting a hyperedge or using an extra color.Proof.Given a Local ECC instance (, , ), a conflict is a node  and  + 1 incident hyperedges each with a unique color.If there are no conflicts, all edges can be satisfied since each node needs at most  colors to satisfy all incident hyperedges.If (, , ) contains a conflict, we branch on the  + 1 possible deletions to resolve it.Each branch increases the number of deleted edges by one, so the search tree has depth at most .Conflicts can be found in  ( ||) time by checking the set of incident hyperedges at each node.Thus, the algorithm runs in  (( + 1)   ||) time which is FPT in  +.□ Theorem 5.5.Global ECC is FPT with respect to  + .
Proof.Given an instance (, , ) of Global ECC, recall that () denotes the set of colors assigned to a node .We define a conflict to be a node  and the incident hyperedges  of color   and  of color   such that   ≠   and   ,   ∉ ().First, we note that in an instance (, , ) partially colored by  with no conflicts, any unsatisfied hyperedges must be monochromatic at each node.Thus, if  assigns at most  colors, we can satisfy the remaining edges in  in linear time by assigning one (free) color to each node.
If (, , ) does contain a conflict, we branch on the four possible ways to resolve it: assign the color   to , assign the color   to , delete , or delete  .All four branches decrease either  or , and so the maximum depth of the search tree is at most  + .Moreover, we can find conflicts in  ( ||) time by checking the set of incident hyperedges at each node.Thus, the algorithm runs in  (4  +  ||) time which is FPT in  + .□ Theorem 5.6.Robust ECC is FPT with respect to  + .
The FPT algorithm for Robust ECC proceeds in the same manner as Global ECC in Theorem 5.5, the main difference being that a node deletion potentially resolves more conflicts than assigning a single additional color.

Kernelization
Finally, we observe that a single reduction rule leads to a kernel for each variant.Specifically, we need only remove easy nodes.A node  is easy if it can be colored without conflict (i.e., only in hyperedges of one color for Global ECC/Robust ECC or at most  colors for Local ECC).Note that removing easy vertices may result in hyperedges with only one node.
Theorem 5.7.Local ECC admits a kernel with  vertices.Global ECC and Robust ECC admit a kernel with  +  vertices.
Proof.First, remove all easy nodes.Since deleting a hyperedge of size  resolves conflicts in at most  nodes, deleting  edges resolves conflicts in at most  nodes.In Global ECC and Robust ECC, conflicts in an additional  nodes can be resolved by using the global budget to assign an additional color or delete a node.Since we must resolve at least one conflict in every node after removing easy nodes, yes-instances have a bounded size.□ From here, multiple brute force algorithms yield FPT results of varying quality depending on the number of colors, the maximum hyperedge size  , and the value of  + .

EXPERIMENTS
We evaluate the approximation algorithms developed in Section 4 on a corpus of six real-world datasets.These have previously served as benchmarks for Edge-Colored Clustering algorithms [5,37], and capture many motivating settings for overlapping categorical clustering (e.g., co-authorship datasets and food ingredient datasets).Though we do not have optimal cluster assignments, we are able to upper bound the approximation ratios via comparison to the optimal LP objective scores.Because our algorithms are the first of their kind, there are no competitors against which to compare.The exception is for choices of  which recover nonoverlapping ECC, in which cases we observe consistency with prior benchmarking results [5,37].Otherwise, these experiments demonstrate that our approximation algorithms can produce optimal or near-optimal solutions to their (NP-hard) objectives, oftentimes with amazingly low runtimes.We then discuss several notable experimental outcomes, building intuition for the ways in which (hyper)graph structure affects Edge-Colored Clustering algorithm performance.Finally, we compare our two models for Overlapping ECC.Our code and datasets are available at https://github.com/TheoryInPractice/overlapping-ecc.
Datasets.Our datasets are summarized in Table 1.Brain [17] is a graph in which nodes represent brain regions and edges represent relationships revealed by MRI scans.There are two edge colorsone for pairs of regions with high fMRI correlation and another for pairs with similar activation patterns.The Drug Abuse Warning Network (DAWN) [36] models drugs with nodes and patients with hyperedges indicating the combination of drugs taken prior to an emergency room visit.Edges are colored by the patient's emergency room results (e.g., "sent home" or "surgery").The MAG-10 hypergraph is constructed from the Microsoft Academic Graph [35], with nodes representing authors, hyperedges capturing co-authorship Table 1: Summary statistics of datasets -number of nodes | |; number of (hyper)edges ||; number of edge colors ; maximum and mean hyperedge size  and  (| |); and maximum and mean chromatic degree Δ  and  (  ).Also, categorical edge clustering performance for the algorithms Local ECC LP (LLP), Local ECC Greedy (LG), Global ECC LP (GLP), Global ECC Greedy (GG), Robust ECC LP (RLP), and Robust ECC Greedy (RG).LLP was run with local budgets  ∈ {1, 2, 3, 4, 5, 8, 16, 32}.GLP was run with budgets  such that /| | ∈ {0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4}.RLP was run with budgets  such that /| | ∈ {0, .01,.05,.1,.15,.2,.25}.Performance is listed in terms of the approximation guarantee given by the LP lower bound (lower is better) and each listed value is the maximum (worst) for the given algorithm across all tested values of , rounded up to three decimal places.groups, and edge colors representing computer science venues to which the groups have submitted papers.When the same group of co-authors has submitted to multiple venues, the most common color is chosen, and ties are discarded.The Cooking dataset [26] represents ingredients with nodes and recipes with hyperedges.Edge colors are used to signify cuisine type.In the Walmart dataset, nodes represent individual products, hyperedges capture groups of products that have been purchased together in a single shopping trip, and edge colors are categorical "trip type" labels assigned by Walmart [25].Finally, in the Trivago dataset [29] nodes are vacation rental properties on the titular website, and edges represent groups of properties viewed by a single user during a single browsing session.Edge colors track the country from which the browsing session took place.Thus, the goal in the context of ECC is to use the resulting hypergraph to cluster properties according to the countries from which they are likely to attract renters.

Algorithm Performance
We evaluate each of the six algorithms developed in Section 4, with particular emphasis on the bicriteria LP-rounding algorithms described by Theorems 4.2, 4.3, and 4.4.We refer to these as Local ECC LP (LLP), Global ECC LP (GLP), and Robust ECC LP (RLP), respectively.We select parameters to produce small approximation guarantees on both the edge mistake and budget objectives (constant-constant where possible).Specifically, we test LLP with  = 1/2, GLP with  = 1/2, and RLP with  = 1/3, guaranteeing approximation factors ( = 2,  = 2−1/), (2 +5, 2), and (6, 3).We test with a variety of values for the budget , in each case selected to be representative of the practically useful parameter space.Table 1 reports how well each algorithm approximates its objective.The maximum  is the ratio between the number of mistakes made by a given algorithm and the lower bound provided by the LP relaxation, maximized across all tested budgets .The maximum  is the approximation factor on the budget, once again maximized across all tested values of .Amazingly, we find that all three LP-rounding algorithms drastically outperform their theoretical guarantees.In particular, Figures 2a and 2b show that both Overlapping ECC algorithms provide nearly perfect performance on the edge mistake objective.Indeed, both LPs frequently find optimal (integral) solutions, leaving no rounding to do.Though we generally understand our objective as minimizing edge mistakes, we can also consider maximizing the number of edge satisfactions.These objectives are equivalent (though they differ in terms of approximations), and just as the LP objective scores provide a lower bound on edge mistakes, we can derive an upper bound on edge satisfactions by subtracting the LP objective scores from ||.Figures 2c and 2d show that both LLP and GLP produce satisfied edge sets with size at least 99.8% of optimal.We note that these observed approximation factors are especially impressive in the GLP context, where the theoretical guarantee is not constant but rather a linear function in . Figure 2b shows that we do not observe this linear relationship in practice, suggesting that our aforementioned interest in further study of rounding techniques for the Global ECC LP relaxation may bear fruit.
Even more remarkable is the performance of these algorithms on .Our empirical results show that, in nearly every case, our bicriteria Overlapping ECC algorithms functioned as single-criteria approximations.Indeed, we were unable to produce a single case in which GLP exceeded its color assignment budget on real-world data.LLP performed nearly as well, producing within-budget approximations on all values of  for 5 out of 6 datasets.The Trivago dataset is the lone exception, though Figure 3c shows that even in this case the algorithm outperformed its guarantee for most budgets.
RLP doesn't achieve the near-exactness of the overlapping variants, but it still significantly outperforms its guarantees.Figures 3a  and 3b show that across all datasets and budgets, we never observe  > 1.25 or  > 2. In fact, in the Cooking and DAWN datasets we see that exceeding the node deletion budget tends to allow for clusterings with fewer edge mistakes than would otherwise be possible.We discuss this in more detail in Section 6.2.
We conclude this section by highlighting the runtime performance of our algorithms.Table 2 gives runtimes for each of our greedy and LP-rounding algorithms.Given their simplicity, it is not surprising that each of our greedy algorithms is extremely fast.The LP-rounding algorithms have more surprising results.Amazingly, both LLP and GLP produce optimal very nearly optimal solutions very quickly on a standard workstation with 64 GB of RAM -the sort of machine widely available at nearly every university or private company.Indeed, even with hundreds of thousands of nodes and hyperedges, LLP never requires more than ∼6 minutes, and GLP never requires more than ∼11.On the same machine, RLP generally has similar runtime performance.Walmart is the exception, but even in this case RLP requires only ∼2 hours.

Discussion
In this section we present several case studies as a means to answering the following question: "what questions should we ask about the structure of our data to understand how ECC algorithms will perform?"We analyze performance in terms of both the problem  objectives and a related notion of cluster quality.Lastly, we compare the clusters generated by our two Overlapping ECC models.
Understanding RLP  Values.Figure 3a raises the following question: why is RLP able to outperform the LP lower bound for Cooking and DAWN, but not for other datasets?The answer lies in the structure of the data.A first hint comes from Table 1: the mean chromatic degrees in these two datasets are notably higher than in all others.A more detailed explanation requires new machinery.We define the non-dominant degree of vertex , denoted    , as the number of (hyper)edges containing  which are not 's most frequent edge color.Formally, we recall the notation of Section 4.1 and define    =   −   (  [1]).The non-dominant degree percentage of node  is the quotient  %  =    /  .Robust ECC can only assign one color to each node, so we should expect that the set of deleted nodes will tend to have high non-dominant degree and non-dominant degree percentage.In fact, the former is precisely the greedy criterion used by RG, and Figure 3d shows that increasing the deletion budget is especially helpful for RG on Cooking.Table 3 gives more detail: Cooking and DAWN are outliers in that relatively many of their nodes have degree well-spread across multiple colors.
Though we have presented this information formally, an intuitive understanding of the data should be sufficient to predict this structural difference.It makes sense that many food ingredients are used widely across multiple cuisines, and that many drugs lead commonly to multiple emergency-room outcomes, based on their common combinations.Conversely, we should expect that vacation rentals, while perhaps receiving interest from multiple countries, attract the vast majority of their interest from their own country.We note that Brain is an interesting case as well.Table 3 shows that it has the least heavy-tailed distribution of non-dominant degree.Additionally, it is an outlier among our datasets in that it is our only (non-hyper) graph, it has only two colors, and it has by far the highest edge density.Further study is needed to understand how the structure of Brain impacts algorithm performance.
Cluster Quality: LLP vs LG.Table 1 tells us that, in terms of , the worst case performances of our LP-rounding algorithms are always better than those of their greedy counterparts.In fact we can say something stronger: across all of our datasets and both Overlapping ECC models, the LP-rounding algorithms always minimized edge mistakes at least as well as their greedy counterparts, oftentimes by wide margins.Counting edge mistakes, however, is not the only way to measure cluster quality.We now ask whether our LP-rounding algorithms maintain their superior performance when evaluated by other metrics, using LLP and LG as an example case.To this end, we call a node  unused by a clustering if it participates in zero satisfied edges.A low number of unused nodes indicates that a method has produced dense clusters, i.e., clusters where members combine well.Figure 4c tells us that in addition to the edge mistake objective, LLP also outperforms LG with respect to cluster density, as indicated by unused nodes.We ask, however, why this difference in performance is variable across datasets.Once again, basic intuition about the datasets and algorithms in question can help us predict the results.Recall that a hyperedge of color  is unsatisfied if any of its nodes is not colored .Consequently, a greedy assignment of color  to node  has more chances to become irrelevant if the hyperedges of color  that contain  tend to be large.Figure 4c supports this intuition: LLP outperforms LG with respect to unused nodes especially well on the Cooking and Walmart datasets, which tend to have large hyperedges, while the performance gap is lesser on datasets with small hyperedges.
Comparison of GLP and LLP.We conclude with a brief comparison of our two models for Overlapping ECC.If the budget  is viewed as an average per node, then we can compare these models directly, i.e., a local budget  is compared to a global budget ( − 1) • | |.Viewed in this way, Global ECC is a generalization of Local ECC, because for such budgets any solution to the latter is also feasible for the former.Thus, whenever budgets are aligned GLP will satisfy a larger percentage of edges (Figure 4a).We note that in real data the margins are quite large: oftentimes GLP satisfies an additional 10 − 40+ percent of edges.Figure 4b    emphasizes this point from an alternate perspective: whenever the budget allows for overlap, using GLP instead of LLP reduces the edge mistake objective score by (oftentimes much) more than 40%.We might also ask whether the increased freedom granted to Global ECC results in higher quality clusters by other measures.In particular, Figure 4d shows that GLP generally outperforms LLP with respect to unused nodes.It is encouraging that the increased freedom granted to Global ECC results in higher quality clusters, even measured by a metric that we do not directly optimize for.

CONCLUSION
We have addressed two shortcomings of existing categorical (hyper)graph clustering methods by presenting three generalizations of Edge-Colored Clustering, allowing for either overlapping clusters or robustness to noisy points.We have addressed the parameterized complexity of each problem, as well as providing greedy singlecriteria and LP-rounding-based bicriteria approximation algorithms.Concrete questions remain.For example, are there constant-factor single-criteria approximations for Local ECC and Robust ECC, and is there a constant-constant bicriteria approximation for Global ECC?More generally, bicriteria inapproximability is a relatively unexplored topic, and it would be interesting to develop the theory of hardness in this area.From the perspective of parameterized algorithms, it is open to determine whether polynomial kernels exist for our problems in the parameter  + , as well as to identify measures of edge-colored hypergraph structure which can provably explain the real-world performance of our approximation algorithms.

Theorem 5 . 4 .
Local ECC is FPT with respect to  + .

Figure 2 :
Figure 2: (a)-(b): Observed  values for LLP and GLP, where the performance is evaluated against the LP lower bound.(c)-(d): Satisfied edge set sizes for LLP and GLP, presented as percentages of the upper bound derived from the LP.

Figure 3 :
Figure 3: (a)-(b): Observed RLP  and  values. values less than 1 indicate that the rounded clustering has fewer edge mistakes than is possible without violating the node deletion budget, while  values less than 1 indicate that less than the full budgeted allotment of nodes were deleted.(c): Observed LLP  values for the Trivago dataset.The black line is the upper bound, 2 − 1/.(d): Edge satisfaction percentages for RG. further Unused: GLP vs LLP

Figure 4 :
Figure 4: (a): Absolute difference between percent GLP and LLP edge satisfaction percentages, with same (average) budget per node.(b): Percent reduction (relative difference) in mistakes when using GLP instead of LLP.(c): Percent reduction in unused nodes when using LLP instead of LG.(d) Percent reduction in unused nodes when using GLP instead of LLP.
subject to different constraints on the color map .Local ECC is restricted to maps satisfying |()| ≤ , Global ECC is restricted to maps satisfying  ∈ (|()| − 1) ≤  and |()| ≥ 1 for every  ∈  , and Robust ECC has the restriction that the number of nodes  satisfying () = [] is at most , while all other nodes  ∈  have one color assignment: |()| = 1.
Note that observed  values can exceed  because the  provides a lower bound on the optimal number of edge mistakes.In the MAG-10 and Trivago cases, the LP found fractional solutions with objective score 0. RLP was able to create a rounded solution with no edge mistakes by exceeding the node deletion budget , but RG was not.

Table 2 :
Algorithm runtimes -Reported figures are the maximum across all tested budgets.Algorithms and budget values are as described in Table1.Experiments were run on a machine with an Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz (8 cores) and 64 GB of RAM.

Table 3 :
Network Structure -For each dataset we present maximum, mean, and median non-dominant degree, as well as the fraction of nodes with chromatic degree > 1, and with non-dominant degree percentage  % at least 5% and 10%.