Efficient Computation of Subspace Skyline over Categorical Domains

Platforms such as AirBnB, Zillow, Yelp, and related sites have transformed the way we search for accommodation, restaurants, etc. The underlying datasets in such applications have numerous attributes that are mostly Boolean or Categorical. Discovering the skyline of such datasets over a subset of attributes would identify entries that stand out while enabling numerous applications. There are only a few algorithms designed to compute the skyline over categorical attributes, yet are applicable only when the number of attributes is small. In this paper, we place the problem of skyline discovery over categorical attributes into perspective and design efficient algorithms for two cases. (i) In the absence of indices, we propose two algorithms, ST-S and ST-P, that exploit the categorical characteristics of the datasets, organizing tuples in a tree data structure, supporting efficient dominance tests over the candidate set. (ii) We then consider the existence of widely used precomputed sorted lists. After discussing several approaches, and studying their limitations, we propose TA-SKY, a novel threshold style algorithm that utilizes sorted lists. Moreover, we further optimize TA-SKY and explore its progressive nature, making it suitable for applications with strict interactive requirements. In addition to the extensive theoretical analysis of the proposed algorithms, we conduct a comprehensive experimental evaluation of the combination of real (including the entire AirBnB data collection) and synthetic datasets to study the practicality of the proposed algorithms. The results showcase the superior performance of our techniques, outperforming applicable approaches by orders of magnitude.


INTRODUCTION 1.1 Motivation
Skyline queries are widely used in applications involving multi-criteria decision making [12], and are further related to well-known problems such as top-k queries [13,1], preference collection [2], and nearest neighbor search [14].Given a set of tuples, skylines are computed by considering the dominance relationships among them.A tuple p dominates another tuple q, if q is not better than p in any dimension and p is better than q in at least one dimension.Moreover, a pair of tuples p and q are considered to be incomparable if neither p nor q dominates the other.The Skyline is the set of tuples that are not dominated by any other tuple in the dataset [4].
In recent years, several applications have gained popularity in assisting users in tasks ranging from selecting a hotel in an area to locating a nearby restaurant.AirBnB, TripAdvisor, hotels.com,Craigslist, and Yelp are a few such examples.The underlying datasets have numerous attributes that are mostly Boolean or categorical.They also include a few numeric attributes, but in most cases the numeric attributes are discretized and transformed into categorical attributes [19].For example, in the popular accommodation rental service AirBnB, the typical attributes are type and number of rooms, types of amenities offered, the number of occupants, etc. Table 1 shows a toy example that contains a subset of attributes present in AirBnB.Note that most of the attributes are amenities provided by the hosts (the temporary rental providers) and are primarily Boolean.The AirBnB dataset features more than 40 such attributes describing amenities users can choose.One way of identifying desirable hosts in such a dataset is to focus on the non-dominated hosts.This is because if a listing t dominates another listing t (i.e., t is at least as good as t on all the attributes while is better on at least one attribute), t should naturally be preferred over t .
In the example shown in Table 1, "Host 1" and "Host 2" are in the skyline, while all the others are dominated by at least one of them.In real-world applications, especially when the number of attributes increases, users naturally tend to focus on a subset of attributes that is of interest to them.For example, during an AirBnB query, we typically consider a few attributes while searching for hosts that are in the skyline.For instance, in the dataset shown in Table 1, one user might be interested in Breakfast and Internet, while another user might focus on Internet, Cable TV, and Pool when searching for a host.
In this paper, we consider the problem of subspace skyline discovery over such datasets, in which given an ad-hoc subset of attributes as a query, the goal is to identify the tuples in the skyline involving only those attributes 1 .Such subspace skyline queries are an effective tool in assisting users in data exploration (e.g., an AirBnB customer can explore the returned skyline to narrow down to a preferred host).
In accordance with common practice in traditional database query processing, we design solutions for two important practical instances of this problem, namely: (a) assuming that no indices exist on the underlying dataset, and (b) assuming that indices exist on each individual attribute of the dataset.The space devoted to indices is a practical concern; given that the number of possible subset queries is exponential we do not consider techniques that would construct indices for each possible subset as that would impose an exponential storage overhead (not to mention increased overhead for maintaining such indices under dynamic updates as it is typical in our scenario).Thus we explore a solution space in which index overhead ranges from zero to linear in the number of attributes, trading space for increased performance as numerous techniques in database query processing typically do [10,6,11].
To the best of our knowledge, LS [19] and Hexagon [22] are the only two algorithms designed to compute skylines over categorical attributes.Both of these algorithms operate by creating a lattice over the attributes in a skyline query, which is feasible only when the number of attributes is really small.

Technical Highlights
In this paper, we propose efficient algorithms to effectively identify the answer for any subspace skyline query.Our main focus is to overcome the limitations of previous works ( [19,22]), introducing efficient and scalable skyline algorithms for categorical datasets.
For the case when no indices are available, we design a tree structure to arrange the tuples in a "candidate skyline" set.The tree structure supports efficient dominance tests over the candidate set, thus reducing the overall cost of skyline computation.We then propose two novel algorithms called ST-S (Skyline using Tree Sorting-based) and ST-P (Skyline using Tree Partition-based) that incorporate the tree structure into existing sorting-and partition-based algorithms.Both ST-S and ST-P work when no index is available on the underlying datasets and deliver superior performance for any subset skyline query.
Then, we utilize precomputed sorted lists [8] and design efficient algorithms for the index-based version of our problem.As one of the main results of our paper, we propose the Threshold Algorithm for Skyline (TA-SKY) capable of answering subspace skyline queries.In the context of TA-SKY, we first start with a brief discussion of a few approaches that operate by constructing a full/partial lattice over the query space.However, these algorithms have a complexity that is exponential in the number of attributes involved in the skyline query.To overcome this limitation, we propose TA-SKY, an interesting adaptation of the top-K threshold (TA) [8] style of processing for the subspace skyline problem.TA-SKY utilizes sorted lists and constructs the projection of the tuples in query space.This adaptation is novel because TA-style algorithms are traditionally utilized to solve top-k problems rather than skyline problems.
TA-SKY proceeds by accumulating information, utilizing sequential access over the indices that enable it to stop early while guaranteeing that all skyline tuples have been identified.The early stopping condition enables TA-SKY to answer skyline queries without accessing all the tuples, thus reducing the total number of dominance checks, resulting in greater efficiency.Consequently, as further discussed in §6, TA-SKY demonstrates an order of magnitude speedup during our experiments.In addition to TA-SKY, we subsequently propose novel optimizations to make the algorithm even more efficient.TA-SKY is an online algorithm -it can output a subset of skyline tuples without discovering the entire skyline set.The progressive characteristic of TA-SKY makes it suitable for web applications, with strict interactive requirements, where users want to get a subset of results very quickly.We study this property of TA-SKY in §6 on the entire AirBnB data collection for which TA-SKY discovered more than two-thirds of the skyline in less than 3 seconds while accessing around 2% of the tuples, demonstrating the practical utility of our proposal.

Summary of Contributions
We propose a comprehensive set of algorithms for the subspace skyline discovery problem over categorical domains.The summary of main contributions of this paper are as follows: • We present a novel tree data structure that supports efficient dominance tests over relations with categorical attributes.
• We propose the ST-S and ST-P algorithms that utilize the tree data structure for the subspace skyline discovery problem, in the absence of indices.
• We propose TA-SKY, an efficient algorithm for answering subspace skyline queries with a linear worst case cost dependency to the number of attributes.The progressive characteristic of TA-SKY makes it suitable for interactive web-applications.This is a novel and the first (to our knowledge) adaptation of the TA style of processing to a skyline problem.
• We present a comprehensive theoretical analysis of the algorithms quantifying their performance analytically, and present the expected cost of each algorithm.
• We present the results of extensive experimental evaluations of the proposed algorithms over real-world and synthetic datasets at scale showing the benefits of our proposals.In particular, in all cases considered we demonstrate that the performance benefits of our approach are extremely large (in most cases by orders of magnitude) when compared to other applicable approaches.

PRELIMINARIES
Consider a relation D with n tuples and m + 1 attributes.One of the attributes is tupleID, which has a unique value for each tuple.Let the remaining m categorical attributes be A = {A1, . . ., Am}.Let Dom(•) be a function that returns the domain of one or more attributes.For example, Dom(Ai) represents the domain of Ai, while Dom(A) represents the Cartesian product of the domains of attributes in A. |Dom(Ai)| represents the cardinality of Dom(Ai).We use t[Ai] to denote the value of t on the attribute Ai.We also assume that for each attribute, the values in the domain have a total ordering by preference (we shall use overloaded notation such as a > b to indicate that value a is preferred over value b).

Skyline
We now define the notions of dominance and skyline [4] formally.For each tuple t ∈ D, we shall also be interested in computing its score value, denoted by score(t), using a monotonic function

Definition 1. (Dominance
preferred to t on any attribute in Q while t is preferred to t on least one attribute in Q. Definition 3. (Subspace Skyline).Given a subspace Q, the Subspace Skyline, SQ, is the set of tuples in DQ that are not dominated by any other tuples, i.e.:

Sorted Lists
Sorted lists are popular data structures widely used by many access-based techniques in data management [7,8].Let L = {L1, L2, . . ., Lm} be m sorted lists, where Li corresponds to a (descending) sorted list for attribute Ai.All these lists have the same length, n (i.e., one entry for each tuple in the relation).Each entry of Li is a pair of the form (tupleID, t[Ai]).
A sorted list supports two modes of access: (i) sorted (or sequential) access, and (ii) random access.Each call to sorted access returns an entry with the next highest attribute value.Performing sorted access k times on list Li will return the first k entries in the list.In random access mode, we can retrieve the attribute value of a specific tuple.A random access on list Li assumes tupleID of a tuple t as input and returns the corresponding attribute value t[Ai].

Problem Definition
In this paper, we address the efficient computation of subspace skyline queries over a relation with categorical attributes.Formally: Subspace Skyline Discovery: Given a relation D with the set of categorical attributes A and a subset of attributes in the form of a subspace skyline query Q ⊆ A, find the skyline over Q, denoted by SQ.
Attribute value returned by i-th sorted access on list L j T Tree for storing the candidate skyline tuples p i the probability that the binary attribute A i is 1 In answering subspace skyline queries we consider two scenarios: (i) no precomputed indices are available, and (ii) existence of precomputed sorted lists.
Table 2 lists all the notations that are used throughout the paper (we shall introduce some of these later in the paper).

SKYLINE COMPUTATION OVER CAT-EGORICAL ATTRIBUTES
Without loss of generality, for ease of explanation, we consider a relation with Boolean attributes, i.e., categorical attributes with domain size 2. We shall discuss the extensions of the algorithms for categorical attributes with larger domains later in this section.
Throughout this section, we consider the case in which precomputed indices are not available.First, we exploit the categorical characteristics of attributes by designing a tree data structure that can perform efficient dominance operations.Specifically, given a new tuple t, the tree supports three primitive operations -i) INSERT(t): inserts a new tuple t to the tree, ii) IS-DOMINATED(t): checks if tuple t is dominated by any tuple in the tree, and iii) PRUNE-DOMINATED-TUPLES(t): deletes the tuples dominated by t from the tree.In Appendix A, we further improve the performance of these basic operations by proposing several optimization techniques.Finally, we propose two algorithms ST-S (Skyline using Tree Sorting-based) and ST-P (Skyline using Tree Partition-based) that incorporate the tree structure to state-of-art sorting-and partition-based algorithms.

Organizing Tuples Tree
Tree structure: We use a binary tree to store tuples in the candidate skyline set.Consider an ordering of all attributes in Q ⊆ A, e.g., [A1, A2, . . ., A m ].In addition to tuple attributes, we enhance each tuple with a score, assessed using a function F (•).This score assists in improving performance during identification of the dominated tuples or while conducting the dominance check.The proposed algorithm is agnostic to the choice of F (•); the only requirement is that the function does not assign a higher score to a dominated tuple compared to its dominator.The structure of the tree for Example 1 is depicted in Figure 1.The tree has a total of 5 (= m + 1) levels, where the i'th level (1 ≤ i ≤ m ) represents attribute Ai.The left (resp.right) edge of each internal node represents value 0 (resp.1).Each path from the root to a leaf represents a specific assignment of attribute values.The leaf nodes of the tree store two pieces of information: i) score: the score of the tuple mapped to that node, and ii) tupleID List: list of ids of the tuples mapped to that node.Note that all the tuples that are mapped to the same leaf node in the tree have the same attribute value assignment, i.e. have the same score.Moreover, the attribute values of a tuple t can be identified by inspecting the path from the root to a leaf node containing t.Thus, there is no requirement to store the attribute values of the tuples in the leaf nodes.Only the leaf nodes that correspond to an actual tuple are present in the tree.
Example 1.As a running example through out this section, consider the relation D with n = 5 non-dominated tuples where its projection on Q = {A1, A2, A3, A4} is depicted in Table 3.The last column of the table presents the score of each tuple, utilizing the function F (•) provided in Equation 1.In that case, the current node is also deleted from the tree.Figure 2 demonstrates the pruning algorithm for t = 1, 0, 1, 1 .Tuples in the tree that are dominated by t are: t2, t4, and t5.The bold edges represent paths followed by the pruning algorithm.Both the left and right children of node a are visited since t[A1] = 1, whereas, at nodes f and b only the left subtree is selected for searching.The final structure of the tree after deleting the dominated tuples is shown in Figure 3. Algorithm 2 PRUNE-DOMINATED-TUPLES Delete n from tree

IS-DOMINATED(t):
The algorithm starts traversing the tree from the root.For each node visited by the algorithm at level i (1 ≤ i ≤ m), we check the corresponding attribute value t[Ai].If t[Ai] = 0, we search both the left and right subtree; otherwise, we only need to search in the right subtree.This is because when t[Ai] = 0, all the tuples dominating t can be either 0 or 1 on attribute Ai.If we reach a leaf node that has an attribute value assignment which is different than that of t (i.e., score = score(t)), t is dominated.Note that, when t[Ai] = 0 both the left and right subtree of the current node can have tuples dominating t, while the cost of identifying a dominating tuple (i.e., the number of nodes visited) may vary depending on whether the left or right subtree is visited first.For simplicity, we always search in the right subtree first.If there exists a tuple in the subtree of a node that dominates tuple t, we do not need to search in the left subtree anymore.Figure 4 presents the nodes visited by the algorithm in order to check if the new tuple t = 0, 0, 1, 0 is dominated.We start from the root node a and check the value of t in attribute A1.Since t[A1] = 0, we first search in the right subtree of a.After reaching to node d, the algorithm backtracks to b (parent of d).This is because t[A3] = 1 and d has no actual tuple mapped under it's right child.Since t[A2] = 0 and we could not identify any dominating tuple in the right subtree of b, the algorithm starts searching in the left subtree and moves to node c.At node c, only the right child is selected, since t[A3] = 1.Applying the same approach at node f , we reach the leaf node e that contains the tupleID t5.Since the value of the score variable at leaf node e is different from score(t), we conclude that tuples mapped into e (i.e., t5) dominate t. return IS-DOMINATED(t, n.lef t, l + 1, s) 10: else: 11: return IS-DOMINATED(t, n.right, l + 1, s)

Skyline using Tree
Existing works on skyline computation mainly focus on two optimization criteria: reducing the number of dominance checks (CPU cost), limiting communication cost with the backend database (I/O cost).Sorting-based algorithms reduce the number of dominance check by ensuring that only the skyline tuples are inserted in the candidate skyline list.Whereas, partition-based algorithms achieve this by skipping dominance tests among tuples inside incomparable regions generated from the partition.However, given a list of tuples T and a new tuple t, in order to discard tuples from T that are dominated by t, both the sorting-and partitionbased algorithms need to compare t against all the tuples in T .This is also the case when we need to check whether t is dominated by T .The tree structure defined in §3.1 allows us to perform these operations effectively for categorical attributes.Since the performance gain achieved by the tree structure is independent of the optimization approaches of previous algorithms, it is possible to combine the tree structure with existing skyline algorithms.We now present two algorithms ST-S (Skyline using Tree Sorting-based) and ST-P (Skyline using Tree Partition-based) that incorporates the tree structure into existing algorithm.ST-S: ST-S combines the tree structure with a sortingbased algorithm.Specifically, we have selected the SaLSa [3] algorithms that exhibits better performance compared to other sorting-based algorithms.The final algorithm is presented in Algorithm 4. The tuples are first sorted according to "maximum coordinate", maxC, criterion 2 .Specifically, Given a skyline query Q, maxC(tQ) = (maxA∈Q{t[A]}, sum(tQ)), where sum(tQ) = A∈Q t[A].A tree structure T is used to store the skyline tuples.Note that the monotonic property of the scoring function maxC(•) ensures that all the tuples inserted in T are skyline tuples.The algorithm then iterates over the sorted list one by one, and for each new tuple t, if t is not dominated by any tuple in tree T , it is inserted in the 2 Assuming larger values are preferred for each attribute.tree (lines 7-8).For each new skyline tuple, the "stop point" tstop is updated if required (line [10][11][12].The algorithm stops if all the tuples are accessed or tstop dominates the remaining tuple.Detailed description of the "stop point" can be found in the original SaLSa paper [3].Algorithm 4 ST-S 1: Input: Tuple list T , Query Q and Tree T ; Output: SQ 2: Sort tuples in D using a monotonic function maxC(•) Output tQ as skyline tuple.10:

ST-P:
We have selected the state-of-art partition-based algorithm BSkyTree [16] for designing ST-P.The final algorithm is presented in Algorithm 5. Given a tuple list T , the SELECT-PIVOT-POINT method returns a pivot tuple p V such that it belongs to the skyline of Q (i.e., SQ).Moreover, p V partitions the tuples in T in a way such that the number of dominance test is minimized (details in [16]).Tuples in T are then split into 2 |Q| lists, each corresponding to one of the 2 |Q| regions generated by p V (lines 7-9).Tuples in L[0] are dominated by p V , hence can be pruned safely.For each pair of lists B i ← |Q|-bit binary vector corresponds to t wrt p V 9: for ∀j ∈ [max, i) : B j B i 14: for ∀t ∈ L[j]: PRUNE-DOMINATED-TUPLES(tQ, T.rootN ode, 1, score(tQ)) 15: SQ ← SQ∪ ST-P(tuples in T ) 16: return SQ Performance Analysis: We now provide a theoretical analysis of the performance of primitive operations utilized by ST-S and ST-P.To make the theoretical analysis tractable, we assume that the underlying data is i.i.d., where pi is the probability of having value 1 on attribute Ai.
The cost of INSERT-TUPLE(tQ) operation is O(m ), since to insert a new tuple in the tree one only needs to follow a single path from the root to leaf.For IS-DOMINATED(tQ) and PRUNE-DOMINATED-TUPLES(tQ), we utilize the number of nodes visited in the tree as the performance measure of these operations.
Consider a tree T with s tuples; Let Cost(l, s) be the expected number of nodes visited by the primitive operations.
Theorem 1. Considering a relation with n binary attributes where pi is the probability that a tuple has value 1 on attribute Ai, the expected cost of IS-DOMINATED(tQ) operation on a tree T , containing s tuples is: where S(l, Please refer to Appendix D for the proof. Theorem 2. Given a boolean relation D with n tuple and the probability of having value 1 on attribute Ai being pi, the expected cost of PRUNE-DOMINATED-TUPLES(tQ) operation on a tree T , containing s tuples is The proof is available in Appendix D Figure 5 uses Equations 2 and 3 to provide an expected cost for the IS-DOMINATE and PRUNE-DOMINATED-TUPLES operations, for varying numbers of tuples in T (s) where m = 20.We compare its performance with the appraoch, where candidate skyline tuples are organized in a list.Suppose there are s tuples in the list; the best case for the domination test occurs when the first tuple in the list dominates the input tuple (O(1 × m )), while in the worst case, none or only the very last tuple dominates it (O(s × m )) [4].Thus, on average the dominance test iterates over half of its candidate list (i.e., s 2 × m comparisons).On the other hand, in order to prune tuples in the list that are dominated by tQ, existing algorithms need to compare tQ with all the entries in the list.Hence, expected cost of PRUNE-DOMINATED-TUPLES is s × m .From the figure, we can see that the expected number of comparisons required by the two primitive operations are significantly less when instead of a list, tuples are organized in a tree.Moreover, as pi increases, the cost of the primitive operations decreases.This is because, when the value of pi is large, the The above simulations show that the tree structure can reduce the cost of dominance test effectively thus improving the overall performance of ST algorithms.Although the analysis has been carried out for i.i.d.data, our experimental results in §6 show similar behavior for other types of datasets.

Extension for Categorical Attributes
We now discuss how to modify ST algorithm for relations having categorical attributes.We need to make the following two changes: • The tree structure designed in §3.1 needs to be modified for categorical attribute.
• We also need to change the tree traversal algorithms used in each of the three primitive operations.
Tree structure: The tree structure will not be binary anymore.

SUBSPACE SKYLINE USING SORTED LISTS
In this section, we consider the availability of sorted lists L1, L2, . . .Lm, as per §2 and utilize them to design efficient algorithms for subspace skyline discovery.We first briefly discuss a baseline approach that is an extension of LS [19].Then in §4.1, we overcome the barriers of the baseline approach proposing an algorithm named TOP-DOWN.The algorithm applies a top-down on-the-fly parsing of the subspace lattice and prunes the dominated branches.However, the expected cost of TOP-DOWN exponentially depends on the value of m (Appendix C).We then propose TA-SKY (Threshold Algorithm for Skyline) in §4.2 that does not have such a dependency.In addition to the sorted lists, TA-SKY also utilizes the ST algorithm proposed in §3 for computing skylines.).After computing the projections of all tuples in query space, we create a lattice over Q and run the LS algorithm to discover the subspace skyline.
We identify the following problems with BASELINE: • It makes two passes over all the tuples in the relation.
• It requires the construction of the complete lattice of size |Dom(Q)|.For example, when Dom(Ai) = 4 and m = 15, the lattice has more than one billion nodes; yet the algorithm needs to map the tuples into the lattice.
One observation is that for relations with categorical attributes, especially when m is relatively small, skyline tuples are more likely to be discovered at the upper levels of the lattice.This motivated us to seek alternate approaches.Unlike BASELINE, TOP-DOWN and the TA-SKY algorithm are designed in a way that they are capable of answering subspace skyline queries by traversing a small portion of the lattice, and more importantly without the need to access the entire relation.

TOP-DOWN
Key Idea: Given a subspace skyline query Q, we create a lattice capturing the dominance relationships among the tuples in DQ.Each node in the lattice represents a specific attribute value combination in query space, hence, corresponds to a potential tuple in DQ.For a given lattice node u, if there exist tuples in DQ with attribute value combination same as u, then all tuples in DQ corresponding to nodes dominated by u in the lattice are also dominated.TOP-DOWN utilizes this observation to compute skylines for a given subspace skyline query.Instead of iterating over the tuples, TOP-DOWN traverses the lattice nodes from top to bottom; it utilizes sorted lists LQ to search for tuples with specific attribute value combinations.When |Q| is relatively small, it is likely one will discover all the skyline tuples just by checking few attribute value combinations, without considering the rest of the lattice.However, the expected cost of TOP-DOWN increases exponentially as we increase the query length.Please refer to Appendix C for the details and the limitations of TOP-DOWN.

TA-SKY
We now propose our second algorithm, Threshold Algorithm for Skyline (TA-SKY) in order to answer subspace skyline queries.Unlike TOP-DOWN that exponentially depends on m, as we shall show in §4.2.1, TA-SKY has a worst case time complexity of O(m n 2 ); in addition, we shall also study the expected cost of TA-SKY.The main innovation in TA-SKY is that it follows the style of the well-known Threshold Algorithm (TA) [8] for Top-k query processing, except that it is used for solving a skyline problem rather than a Top-k problem.
TA-SKY iterates over the sorted lists LQ until a stopping condition is satisfied.At each iteration, we perform m parallel sorted access, one for each sorted list in LQ.Let cvij denote the current value returned from sorted access on list Lj ∈ LQ (1 ≤ j ≤ m ) at iteration i.Consider τi be the set of values returned at iteration i, τi = {cvi1, cvi1, . . ., cv im }.We create a synthetic tuple tsyn as the threshold value to establish a stopping condition for TA-SKY.The attribute values of synthetic tuple tsyn are set according to the current values returned by each sorted list.Specifically, at iteration i, tsyn[Aj] = cvij, ∀j ∈ [1, m ].In other words, tsyn corresponds to a potential tuple with the highest possible attribute values that has not been seen by TA-SKY yet.
In addition, TA-SKY also maintains a candidate skyline set.The candidate skyline set materializes the skylines among the tuples seen till the last stopping condition check.We use the tree structure described in §3.2 to organize the candidate skyline set.Note that instead of checking the stopping condition at each iteration, TA-SKY considers the stopping condition at iteration i only when for at least one of the m sequential accesses.This is because the stopping condition does not change among iterations that have the same τ value.Let us assume the value of τ changes at the current iteration i and the stopping condition was last checked at iteration i (i < i).Let T be the set of tuples that are returned in, at least one of the sequential accesses between iteration i and i.For each tuple t ∈ T , we perform random access in order to retrieve the values of missing attributes (i.e., attributes of tQ for which we do not know the values yet).Once the tuples in T are fully constructed, TA-SKY compares them against the tuples in the candidate skyline set.For each tuple t ∈ T three scenarios can arise: 1. t dominates a tuple t in the tree (i.e., candidate skyline set), t is deleted from the tree.
2. t is dominated by a tuple t in the tree, it is discarded since it cannot be skyline.
3. t is not dominated by any tuple t in the tree, it is inserted in the tree.
Once the candidate skyline set is updated with tuples in T , we compare tsyn with the tuples in the candidate skyline set.The algorithm stops when tsyn is dominated by any tuple in the candidate skyline set.
We shall now explain TA-SKY for the subspace skyline query Q of Example 2. Sorted lists LQ corresponding to query Q are shown in Figure 6.At iteration 1, TA-SKY retrieves the tuples t1, t2 and t5 by sequential access.For t1 we know its value on attributes A2 and A4 whereas for t2 and t5 we know their value on A3 and A1 respectively.At this position we have T = {t1, t2, t5} and τ1 = {1, 1, 1, 1}.Note that in addition to storing the tupleIDs that we have seen so far, we also keep track of the attribute values that are known from sequential access.After iteration 2, T = {t1, t2, t3, t5, t6} and τ2 = {1, 1, 1, 1}.At iteration 3 we retrieve the values of t1, t2, t5 and t4 on attributes A1, A2, A3, and A4 respectively and update the corresponding entries T .Since τ3 = {0, 0, 1, 1} is different from τ2, TA-SKY checks the stopping condition.First, we get the missing attribute values (attribute values which are not known from sequential access) of each tuple t ∈ T .This is done performing random access on the appropriate sorted list in LQ.After all the tuples in T are fully constructed, we update the candidate skyline set using them.The final candidate skyline set is constructed after considering all the tuples in T is {t1, t5, t6}.Since the synthetic tuple tsyn = 0, 0, 1, 1 corresponds to τ3 is dominated by the candidate skyline set, we stop scanning the sorted lists and output the tuples in the candidate skyline set as the skyline answer set.
The number of tuples inserted into T (i.e., partially retrieved by sequential accesses) before the stopping condition is satisfied, impacts the performance of TA-SKY.This is because for each tuple t ∈ T , we have to first perform random accesses in order to get the missing attribute values of t and then compare t with the tuples in the candidate skyline set in order to check if t is skyline.Both the number of random accesses and number of dominance tests increase the execution time of TA-SKY.Hence, it is desirable to have a small number of entries in T .We noticed that the number of tuples inserted in T by TA-SKY depends on the organization of (tupleID, value) pairs (i.e., ordering of pairs having same value) in sorted lists.Figure 7 displays sorted lists L Q for the same relation in Example 2 but with different organization.Both with LQ and L Q TA-SKY stops at iteration 3.However, For LQ after iteration 3, T = {t1, t2, t3, t4, t5, t6} and we need to make a total of 12 random accesses and 12 dominance tests4 .On the other hand, with L Q , after iteration 3 we have T = {t1, t2, t5, t6}, requiring only 4 random accesses and 8 dominance tests.
One possible approach to improve the performance of TA-SKY is to re-organize the sorted lists before running the algorithm for a given subspace skyline query.Specifically, ∀t, t ∈ D that t[Ai] = t [Ai], position t before t in the sorted list Li (1 ≤ i ≤ m ) if t has better value than t on the remaining attributes.However, re-arranging the sorted lists for each subspace skyline query will be costly.
We now propose several optimization techniques that enable TA-SKY to compute skylines without considering all the entries in T .Selecting appropriate entries in T : Our goal is to only perform random access and dominance checks for tuples in T that are likely to be skyline for a given subspace skyline query.Consider a scenario where TA-SKY needs to check the stopping condition at iteration k, i.e, τ k = τ (k−1) .Let Q be the set of attributes for which the value returned by sequential access at iteration k is different from (k − 1)th iteration, Q = {Ai|Ai ∈ Q, cv ki < cv (k−1)i }.In order for the tuple tsyn to be dominated, there must exist a tuple t ∈ T that has t . This is because for all Ai ∈ Q \ Q sorted access returns same value on both (k − 1)-th and k-th iteration (i.e., cv (k−1)i = cv ki ).Hence, the only way a tuple t ∈ T can dominate tsyn is to have a larger value on any of the attributes in Q .Therefore, we only need to consider a subset of tuples T = {t|t ∈ T , ∃Ai ∈ Q\Q s.t.t[Ai] = cv (k−1)i }.Note that it is still possible that ∃t, t ∈ T s.t.t Q t .Thus, we need to only consider the tuples that are skylines among T and the candidate skyline set.To summarize, before checking the stopping condition at iteration k, we have to perform the following operations: (i) Select a subset of tuples T from T that are likely to dominate tsyn, (ii) For each tuple t ∈ T get the missing attribute values of t performing random access on appropriate sorted lists, (iii) Update the candidate skyline set using the skylines in T , and (iv) Check if tsyn is dominated by the updated candidate skyline set.
Note that in addition to reducing the number of random access and dominance test, the above optimization technique makes the TA-SKY algorithm progressive, i.e, tuples that are inserted into the candidate skyline set will always be skyline in the query space Q.This characteristic of TA-SKY makes it suitable for real-world web applications where instead of waiting for all the results to be returned users want a subset of the results very quickly.Utilizing the ST algorithms: We can utilize the ST algorithms for discovering the skyline tuples from T .This way we can take advantages of the optimization approaches proposed in §3.For example, we can call ST-S algorithm with parameter: tree T (stores all the tuples discovered so far) and tuple list T .The output skyline tuples in T that are not dominated by T .Moreover, after sorting the tuples in ST-S, if we identify that score(ti) = score(ti−1) (2 ≤ i ≤ |T |) and ti−1 is dominated, we can safely mark ti as dominated.This is because score(ti) = score(ti−1) implies that both ti and ti−1 have same attribute value assignment.When the number of attributes in a subspace skyline query is small, this approach allows us to skip a large number of dominance tests.
The pseudocode of TA-SKY, after applying the optimizations above, is presented in Algorithm 6.

Performance Analysis
Worst Case Analysis: In the worst case, TA-SKY will for each sorted list Li ∈ LQ 6: Ai = Attribute corresponds to Li 7: (tupleID, value) = SortedAccess(L) 8: T [tupleID][Ai] = value 9: τ [Ai] = value 10: if τ remains unchanged from prev.iteration: 11: continue; 12: Delete entries from T that are inserted in T 15: for each t ∈ T 16: for Update score of t 20: ST-S(T , Q, T ) 21: tsyn = Synthetic tuple with values of τ 22: until IS-DOMINATED(tsyn, T.root, 1, score(tsyn)) exhaust all the m sorted lists.Hence, will perform O(m n) sorted and O(m n) random accesses.After all the tuples are fully constructed, for each tuple t, we need to check whether any other tuple in T dominates t.The cost of each dominance check operation is O(m n).Hence, cost of n dominance checks is O(m n 2 ).Therefore, the worst case time complexity of TA-SKY is O(m n 2 ) Expected Cost Analysis: Lemma 1. Considering pi as the probability that a tuple has value 1 on the binary attribute Ai, the expected number of tuples discovered by TA-SKY after i iterations is: where Pseen(t, i) is computed using Equation 5.
Refer to Appendix D for the proof.
Theorem 3. Given a subspace skyline query Q, the expected number of sorted accesses performed by TA-SKY on an n tuple boolean relation with probability of having value 1 on attribute Aj being pj is, where Pstop(i) is computed using Equations 7, 8, and 9.
The proof is available in Appendix D

RELATED WORK
In the database context, the skyline operator was first introduced in [4].Since then much work aims to improve the performance of skyline computation in different scenarios.In this paper, we consider skyline algorithms designed for centralized database systems.
To the best of our knowledge, LS [19] and Hexagon [22] are the only two algorithms designed to compute skylines over categorical attributes.Both algorithms operate by first creating the complete lattice of possible attribute-value combinations.Using the lattice structure, non-skyline tuples are then discarded.Even though LS and Hexagon can discover the skylines in linear time, the requirement to construct the entire lattice for each skyline is strict and not scalable.The size of the lattice is exponential in the number of attributes in a skyline query.Moreover, in order to discover the skylines, the algorithms have to scan the entire dataset twice, which is not ideal for online applications.
Most of the existing work on skyline computation concerns relations with numeric attributes.Broadly speaking, skyline algorithms for numerical attributes can be categorized as follows.Sorting-based Algorithms utilize sorting to improve the performance of skyline computation aiming to discard nonskyline objects using a small number of dominance checks [5] [9].For any subspace skyline query, such approaches will require sorting the dataset.SaLSa [3] is the best in this category and we demonstrated how our adaptation on categorical domains, namely ST-S outperforms SaLSa.
Partition-based Algorithms recursively partition the dataset into a set of disjoint regions, compute local skylines for each region and merge the results [4] [27].Among these, BSkyTree [16] has been shown to be the best performer.We demonstrated that our adaptation of this algorithm, namely ST-P, for categorical domains outperforms the vanilla BSkyTree when applied to our application scenario.Other partitioning algorithms, such as NN [15], BBS [20] and ZSearch [17] utilize indexing structures such as R-tree, ZB-tree for efficient region level dominance tests.However, adaptations of such algorithms in the subspace skyline problem would incur exponential space overhead which is not in line with the scope of our work (at most linear to the number of attributes overhead).
A body of work is also devoted to Subspace Skyline Algorithms [26,21] which utilize pre-computation to compute skylines for each subspace skyline query.These algorithms impose exponential space overhead, however.Further improvements to reduce the overhead [23] [24] [25] [18] are highly data dependent and offer no guarantees for their storage requirements.Skyline Count, z=1.01

Experimental Setup
In this section, we describe our experimental results.In addition to the theoretical analysis presented in §3 and §4, we compared our algorithms experimentally against existing state-of-the-art algorithms.Our experiments were run over synthetic data, as well as real-world data collected from AirBnB 5 .The synthetic data was used to evaluate the effectiveness of the proposed methods over varying characteristics of the dataset.Synthetic Datasets: In order to study the performance of the proposed algorithms in different scenarios, we generated a number of Zipfian datasets, each containing 2M tuples and 30 attributes.Specifically, we created datasets with attribute cardinality ranging from 2 − 8.In this environment, the frequency of an attribute value is inversely proportional to its rank.Therefore, the number of tuples having a higher (i.e., better) attribute value is less than then number of tuples with a comparatively lower attribute value.We used a Python package for generating these datasets.For each attribute, we specify its distribution over the corresponding domain by controlling the z value.Two attributes having the same cardinality but different z values will have different distributions.Specifically, the attribute with lower z value will have a higher number of tuples having higher attribute value.Unless otherwise specified, we set the z values of the attributes evenly distributed in the range (1,2] for generating synthetic datasets. Choice of dataset: we used Zipfian datasets as they reflect more precisely situation with real categorical datasets.Specifically, in real-world applications, for a specific attribute, the number of objects having higher attribute values (i.e., better) is likely to be less than the number of objects with lower attribute values.For example, in AirBnB, 3 bed room hosts are less frequent than hosts having a single bed room.Similarly, in Craigslist, sedans are more prevalent than sports cars.Moreover, in real-world applications, the distributions 5 http://www.airbnb.com/ of attributes are different from one another.For example, in our AirBnB dataset, approximately 600k out of the 2M hosts have amenity Cable TV.Whereas, the approximate number of hosts with amenity Hot Tub is only 200k.AirBnB Dataset: Probably one of the best fits for the application of this paper is AirBnB.It is a peer-to-peer location-based marketplace in which people can rent their properties or look for an abode for a temporary stay.We collected the information of approximately 2 million real properties around the globe, shared on this website.AirBnB has a total number of 41 attributes for each host that captures the features and amenities provided by the hosts.Among all the attributes, 36 of them are boolean (categorical with domain size 2) attributes, such as Breakfast, Cable TV, Gym, and Internet, while 5 are categorical attributes, such as Number of Bedrooms, and Number of Beds etc.We tested our proposed algorithms against this dataset to see their performance on real-world applications.Algorithms Evaluated: We tested the proposed algorithms, namely ST-S, ST-P, TOP-DOWN, and TA-SKY as well as the state-of-art algorithms LS [19], SaLSa [3] and BSkyTree [16] that are applicable to our problem settings.Performance Measures: We consider running time as the main performance measure of the algorithms proposed in this paper.In addition, we also investigate the key features of ST-S, ST-P and TA-SKY algorithm and demonstrate how they behave under a variety of settings.Each data point is obtained as the average of 25 runs.Hardware and Platform: All our experiments were performed on a quad-core 3.5 GHz Intel i7 machine running Ubuntu 14.04 with 16 GB of RAM.The algorithms were implemented in Python.

Experiments over Synthetic Datasets
Effect of Query Size m : We start by comparing the performance of our algorithms with existing state-of-art algorithms that exhibit the best performance in their respec-tive domain.Note that, unlike TA-SKY, the rest of the algorithms do not leverage any indexing structure.The goal of this experiment is to demonstrate how utilizing a small amount of precomputation (compared to the inordinate amount of space required by Skycube algorithms) can improve the performance of subspace skyline computation.Moreover, the precomputation cost is independent of the skyline query.This is because we only need to build the sorted lists once at the beginning.For this experiment, we set n = 500k and vary m between 6 − 24.In order to match real-world scenarios, we selected attributes with cardinality c ranging between 2 − 6.Specifically, 50% of the selected attributes have cardinality 2, 30% have cardinality 4, and 20% have cardinality 6. Figure 8 shows the experiment result.We can see that when m is small, TA-SKY outperforms other algorithms.This is because, with small query size, TA-SKY can discover all the skylines by accessing only a small portion of the tuples in the dataset.However as m increases, the likelihood of a tuple dominating another tuple decreases.Hence, the total number of tuples accessed by TA-SKY before the stopping condition is satisfied also increases.Hence, the performance gap between TA-SKY and ST-S starts to decrease.Both ST-S and ST-P exhibits better performance compared to their baseline algorithms (SaLSa and BSkyTree).Algorithms such as ST-P, BSkyTree, and LS do not scale for larger values of m .This is because all these algorithms operate by constructing a lattice over the query space which grows exponentially.Moreover, even though TOP-DOWN initially performed well, it did not not complete successfully for m > 4.
Figure 9 demonstrates the effect m and z on the performance of TA-SKY and ST-S.For this experiment, we created two datasets with cardinality c = 6 and different z values.In the first dataset, all the attributes have same z value (i.e., z = 1.01), whereas, for the second dataset, z values of the attributes are evenly distributed within the range (1,2].By setting z = 1.01 for all attributes, we increase the frequency of tuples having preferable (i.e., higher) attribute values.Hence, the skyline size of the first dataset is less than the skyline size of the second dataset.This is because tuples with preferable attribute values are likely to dominate more non-skyline tuples, resulting in a small skyline size.Moreover, this also increases the likelihood of the stopping condition being satisfied at an early stage of the iteration.Hence, TA-SKY needs less time for the dataset with z = 1.01.In summary, TA-SKY performs better on datasets where more tuples have preferable attribute values.The right-y-axis of Figure 9 shows the skyline size for each query length.One can see that as the query size increased, the chance of tuples dominating each other decreased, which resulted in a significant increase in the skyline size.Please note that the increases in the execution time of TA-SKY are due to the increase in the skyline size which is bounded by n.Moreover, as m increases, there is an initial decrease in skyline size.This is because when m is small (i.e., 2), the likelihood of a tuple having highest value (i.e., preferable) on all attribute is large.Effect of Dataset Size (n): Figure 10 shows the impact of n on the performance of TA-SKY and ST-S.For this experiment, we used dataset with cardinality c = 6, m = 12 and varied n from 500K to 2M.As we increase the value of n, the number of skyline tuples increases.With the increase of skyline size, both TA-SKY and ST-S needs to process more tuple before satisfying the stop condition.Therefore, total execution time increases with the increase of n.Effect of Attribute Cardinality (c): In our next experiment, we investigate how changing attribute cardinality affects the execution time of TA-SKY and ST-S.We set the dataset size to n = 1M while setting the query size to m = 12, and vary the attribute cardinality c from 4 to 8. Figure 11 shows the experiment result.Increasing the cardinality of the attributes increases the total number of skyline tuples.Therefore, effects the total execution time of TA-SKY and ST-S.Progressive Behavior of TA-SKY: Figure 12 and 13 demonstrates the incremental performance of TA-SKY for discovering the new skylines for a specific query of size m = 12, while n = 1M and all the attributes having cardinality c = 12. Figure 12 shows the CPU time as a function of the skyline size returned.We can see that even though the full skyline discovery takes 250 seconds, within the first 50 seconds TA-SKY outputs more than 50% of the skyline tuples.Figure 13 presents the number of tuples TA-SKY accessed as a function of skyline tuples discovered so far.The skyline contains more than 33k tuples.In order to discover all the skylines, TA-SKY needs to access almost 700K (70%) tuples.However, we can see that more than 80% of the skyline tuples can be discovered by accessing less that 30% tuples.

Experiments over AirBnB Dataset
In this experiment, we test the performance of our final algorithm, TA-SKY, against the real Airbnb dataset.We especially study (i) the effects of varying m and n on the performance of the algorithm and (ii) the progressive behavior of it.Effect of Varying Query Size (m ): In our first experiment on AirBnB dataset, we compared the performance of different algorithms proposed in the paper with existing works.We varied the number of attributes in the query (i.e., m ) from 2 to 24 while setting the number of tuples to 1,800,000.Figure 14 shows the experiment result.Similar to our experiment on the synthetic dataset (Figure 8), TA-SKY and ST-S perform better than the remaining algorithms.Even though initially performing well, TOP-DOWN did not scale after query length 4.This is because, with m > 4, the skyline hosts shift to the middle of the corresponding query lattice, requiring TOP-DOWN to query many lattice nodes.Figure 15 shows the relation between the performance of TA-SKY and the skyline size.Unlike the generally accepted rule of thumb that the skyline size grows exponentially as the number of attributes increases, in this experiment, we see that the skyline size originally started to decrease as the query size increased and then started to increase again after query size 12.The reason for that is because when the query size is small and n is relatively large, the chance of having many tuples with (almost) all attributes in Q being 1 (for Boolean attributes) is high.None of these tuples are dominated and form the skyline.However, as the query size increases, the likelihood of having a tuple in the dataset that corresponds to the top node of the lattice decreases.Hence, if the query size gets sufficiently large, we will not see any tuple corresponding to the top node.From then the skyline size will increase with the increase of query size.Effect of Varying Dataset Size (n): In this experiment, we varied the dataset size from 500,000 to 1,800,000 tuples, while setting m to 20. Figure 16 shows the performance of TA-SKY and ST-S in this case.Once can see that between these two algorithms, the cost of ST-S grows faster.Moreover, even though in the worst case TA-SKY is quadratically dependent on n, it performs significantly better in practice.
Especially in this experiment, a factor of 4 increase in the dataset size only increased the execution time by less than a factor of 3.
Progressive Behavior of TA-SKY: As explained in §4.2, TA-SKY is a progressive algorithm, i.e., tuples that are inserted into the candidate skyline set are guaranteed to be in SQ.This characteristic of TA-SKY makes it suitable for real world (especially web) applications, where, rather than delaying the result until the algorithm ends, partial results can gradually be returned to the user.Moreover, we can see that TA-SKY tends to discover a large portion of the skyline quickly within a short execution time with a few number of tuple accesses (as a measure of cost in the web applications).
To study this property of the algorithm, in this experiment, we set n = 1, 800, 000 and m = 20 and monitored the execution time, as well as the number of tuple accesses, as the new skyline tuples are discovered.Figures 17 and 18 show the experiment results for the execution time and the number of accessed tuples, respectively.One can see in the figure that TA-SKY performed well in discovering a large number of tuples quickly.For example, (i) as shown in Figure 17, it discovered more than 2 3 of the skylines in less that 3 seconds, and (ii) as shown in Figure 18, more than half of the skylines were discovered by only accessing less than 2% of the tuples (20, 000 tuples).rooted at u, since it's not possible to have a tuple t under u that is dominated by t (due to monotonicity).Similarly, while checking if t is dominated by any other tuple in the tree, we stop traversing the subtree rooted at an internal node u if currentScore is higher than the maxScore value of u.
Figure 19 presents the value of minScore and maxScore at each internal node of the tree for the relation in Table 3.Consider a new tuple t = 1, 0, 0, 0 .In order to prune the tuples dominated by t, we start from the root node a.At node a currentScore = score(t) = 8.Since, t[A0] = 1, we need to search both the left and right subtree of a.The value of currentScore at node c remains unchanged since the edge that was used to reach c from a matches the value of t[A0].However, for b the value of currentScore has to be updated.The currentScore value at node b is obtained by changing the value of t[A0] to 0 (values of the other attributes remain the same as in the parent node) and compute the score of the updated tuple.Note that the value of currentScore is less than minScore in both nodes b and c.Hence we can be sure that no tuple in subtrees rooted at node b and c can be dominated by t.

B. EXTENDING THE DATA STRUCTURE FOR CATEGORICAL ATTRIBUTES
We now discuss how to modify ST algorithm for relations having categorical attributes.We need to make the following two changes: • The tree structure designed in §3.1 needs to be modified for categorical attribute.• We also need to change the tree traversal algorithms used in each of the three primitive operations.
Tree structure: The tree structure will not be binary anymore.

C. TOP-DOWN
Here we provide the details of the TOP-DOWN algorithm proposed in § 4.1.Given a subspace skyline query Q, consider the corresponding subspace lattice.Each node u in the  lattice corresponds to a unique attribute combination which can be represented by a unique id.We assume the existence of the following two functions, (i) ID(C): returns the id of an attribute value combination, and (ii) InvID(id, m ): returns the corresponding attribute-value combination for id.
The details of these functions can be found in [22].
We observe that given a node identifier id, one can identify the ids of the parents (resp.children) of its corresponding node by calling the two functions InvID and ID.To do so, we first determine the corresponding attribute combination of id.Then identify its parents' (resp.children) combinations by incrementing (resp.decrementing) the value of each attribute, and finally compute the id of each combination using the function ID.TOP-DOWN starts by traversing the lattice from the top node of the lattice.At this node all attributes have the maximum possible value; then conducts a BFS over it while constructing the level (i − 1) nodes from the non-empty nodes at level i.A node in the lattice is dominated if either one of its parents is dominated or there exists a tuple in the relation that matches the combination of one of its parents.
Let id denote the id of the node in the lattice currently scanned by TOP-DOWN.The algorithm first identifies the parents of the current node and checks if all of them (i) have been constructed (i.e. have not been dominated) and (ii) are marked as not present (i.e., there is no tuple in DQ that had the combination of one of its parents).If so, the algorithm then checks if there exist tuples in DQ with the same attribute value combination.We use the term querying a node in order to refer to this operation.Algorithm 7 presents pseudocode of this operation for a specific attribute value combination.If no such tuple exists in DQ, it marks id as not present and moves to the element.Otherwise, it labels id as present and outputs the tuples, returned from GET-TUPLES, as the skyline.The TOP-DOWN algorithm queries a node only when the attribute value combination corresponding to the node is incomparable with the skylines discovered earlier.The algorithm stops when there are no other ids in its processing queue.
The lattice structure for the subspace skyline query Q in Example 2 is shown in Figure 20.Each node u in the lattice represents a specific attribute value assignment in the data space corresponding to Q.For example, the top-most node in the lattice represents a tuple t with all the attribute values 1 (i.e., t[Ai] = 1, ∀Ai ∈ Q).We start from the top node of the lattice.No tuple in DQ has value 1 on all the attributes in Q.Therefore, TOP-DOWN marks this node not present (np).We then move to the next level and start scanning nodes from the left.There exists a tuple t6 ∈ DQ with attribute values 1, 1, 1, 0 .Hence, we mark this node present (p) and output t6 as skyline.The algorithm stops after querying node 0, 1, 0, 1 .TOP-DOWN only needs to query 6 nodes (i.e., check 6 attribute value combinations) tupleList.append(tnew)12: return tupleList; in order to discover the skylines.Note that the number of nodes queried by TOP-DOWN is proportional to the number of attributes in Q and inversely proportional to the relation size n.This is because with large n and small |Q|, the likelihood of having tuples in the relation that correspond to the upper-level nodes of the lattice is high.Algorithm GET-TUPLES: The algorithm to retrieve tuples in the relation matching the attribute value combination of a specific node is described in Algorithm 7. The algorithm accepts two inputs: (1) values array representing the value of each attribute Ai ∈ Q, and (2) Sorted lists LQ.For each attribute Ai ∈ Q (1 ≤ i ≤ m ), the algorithm retrieves the set of tupleIDs Si, that have value equals values[i].This is done by performing a search operation on sorted list Li.The set of tupleIDs that are discovered in every Si are the ids of the tuple that satisfy the current attribute value combination.We identify these ids by performing a set intersection operation among all the Sis (1 ≤ i ≤ m ).Once the ids of all the tuples that match values of array values are identified, the algorithm creates tuples for each id with the same attribute value and returns the tuple list.Output all the tuples in tupleList as skyline.

C.0.1 Performance Analysis
For each non-dominated node in the lattice, the TOP-DOWN algorithm invokes the function GET-TUPLES.Hence, we measure the cost of TOP-DOWN as the number of nodes in the lattice for which we invoke GET-TUPLES, times the cost of executing GET-TUPLES function.Since the size of all sorted lists is equal to n, applying binary search on the sorted lists to obtain tuples with a specified value on attribute Ai requires O(log(n)); thus the retrieval cost from all the m lists is O(m log(n)).Still taking the intersection between the lists is in O(nm ), which makes the worst case cost of the GET-TUPLES operation to be O(nm ).Let k be the cost of GET-TUPLES operation over LQ, for the given relation D.Moreover, considering pi as the probability that a tuple has value 1 on the binary attribute Ai, we use C(l) to refer to the expected cost of TOP-DOWN algorithm starting from a node u at level l of the lattice.
Theorem 4. Consider a boolean relation D with n tuples and the probability of having value 1 on attribute Ai being pi, and a subspace skyline query Q with m attributes.The expected cost of TOP-DOWN on D and Q starting from a node at level l is described by the following recursive forumula: where Proof.Consider a node u at level l of the lattice.Node u represents a specific attribute value assignment with l number of 0s and (m − l) number of 1s.Querying at node u will return all tuples in dataset that have the same attribute value assignment as u.Let p(t, l) be the probability of a tuple t ∈ DQ having l number of 0s and (m − l) number of 1s.
If querying at node u returns at-least one tuple then we do not need to traverse the nodes dominated by u anymore.However, if there exists no tuple in DQ that corresponds to the attribute value combination of u, we at-least have to query the nodes that are immediately dominated by u.Let p !∅ (l) be the probability that there exists a tuple t ∈ DQ that has the same attribute value assignment as u.Then, There are total (m − l) number of nodes immediately dominated by u.Therefore, Cost at node u is the cost of query operation (i.e., k) plus with (1−p !∅ (l)) probability the cost of querying its (m − l) immediately dominated nodes.
Note that a node u at level l has total l number of immediate dominators causing the cost at node u to be computed l times.However, TOP-DOWN only needs to perform only one query at node u.Hence, the actual cost can be obtained by dividing the computed cost with value l.The expected cost increases exponentially as we increase the query length.Moreover, the expected cost also increases when the attributes in Q have higher cardinality.

D. PROOFS
In this section, we provide detailed proofs for the theorems from the main section of the paper.
Theorem 1. Considering a relation with n binary attributes where pi is the probability that a tuple has value 1 on attribute Ai, the expected cost of IS-DOMINATED(tQ) operation on a tree T , containing s tuples is as specified in Equation 2.
Proof.Consider t be the tuple for which we have to check if it is dominated.IS-DOMINATED stops the recursion when we reach a leaf node or move to a node that is empty (i.e., has no tuple mapped under it).Therefore, C(m , s) = 1 and C(l, 0) = 1.
Let us assume that we are in node u at level l of the tree and there are s tuples mapped in the subtree rooted at u.
If t[A l ] = 0, IS-DOMINATED first searches in the right subtree.If no tuple t Q in the right subtree dominates tQ, we then move to the left subtree.Let us assume the right subtree of u contains s right number of tuples (s right ≤ s).Let S(l, s right ) be the probability that there exists a tuple in the right subtree of u containing s right tuples that dominates tQ.In order for a tuple t Q to dominate tQ, it must have at-least value 1 on the attributes in A ones(t[l+1:m ]) .This is because, since t [Ai] ≥ t[Ai] (1 ≤ i ≤ l−1) and t [A l ] > t[A l ], having value 1 on attributes in A ones(t[l+1:m ]) is enough for t Q to dominate tQ.Hence, the probability of t Q dominating A node at level-l containing s tuples under it with the probability of having 1 on attribute A l being p l , the left subtree will have i tuples with the binomial probability s i (1 − p l ) i p s−i l .Hence, expected cost node u, C(l, s) is, Theorem 2. Given a boolean relation D with n tuple and the probability of having value 1 on attribute Ai being pi, the expected cost of PRUNE-DOMINATED-TUPLES(tQ) operation on a tree T , containing s tuples is as computed in Equation 3.
Proof.PRUNE-DOMINATED-TUPLES(tQ) stops the recursion when we reach a leaf node or move to a node that is empty (i.e., has no tuple mapped under it).Therefore, C(m , s) = 1 and C(l, 0) = 1.
Suppose we are in node u at level l of the tree and there are s tuples mapped in the subtree rooted at u.
If t[A l ] = 0, we need to search only in the left subtree.Whereas, for t[A l ] = 1 we need to search both the left and right subtree.
Let p l be the probability of having value 1 on attribute A l .The left subtree of node u at level l (with s tuples under it) will have i tuples with the binomial probability Lemma 1. Considering pi as the probability that a tuple has value 1 on the binary attribute Ai, the expected number of tuples discovered by TA-SKY after iterating i lines is as computed in Equation 4.
Proof.The probability that a tuple t is discovered by iterating i rows is one minus the probability that t is not discovered in any of the m lists in LQ.Formally: where P !seen (t, i, Lj) is the probability that t is not discovered at list Lj until row i. P !seen (t, i, Lj) depends on the number of (tupleId, value) pairs with value 1 in list Lj.A list Lj has k number of (tupleId, value) pairs with value 1 if the database has k tuples with value 1 on attribute Aj, while others have value 0 on it.Thus, the probability that Lj has k number of (tupleId, value) pairs with value 1: t is not seen until row i at list Lj if either of the following cases happen: • t[Aj] = 0 and (considering the random positioning of tuples in lists) t is located after position i in list Lj for all the cases that Lj has k (k < i) number of (tupleId, value) pairs with value 1.
• t[Aj] = 1 and (considering the random positioning of tuples in lists) t is located after position i in list Lj for all the cases that Lj has k (k > i) number of (tupleId, value) pairs with value 1. Thus: We now can compute Pseen(t, i) as following: Having the probability of a tuple being discovered by iterating i lines, the expected number of tuples discovered by iterating i lines is: 4Theorem 3. Given a subspace skyline query Q, the expected number of sorted access performed by TA-SKY on a n tuple boolean database with probability of having value 1 on attribute Aj being pj is, where Pstop(i) is computed using Equations 7, 8, and 9.
Proof.Let us first compute the probability that algorithm stops after visiting i rows of the lists.Please note that the algorithm checks the stopping condition at iteration i if cvij = 0 for at least one sorted list.Thus the algorithm stops when (1) cvij = 0 for at least one sorted list AND (2) there exists a tuple among the discovered ones that dominates the maximum possible tuple in the remaining lists.
Suppose i tuples have seen at least in one of the list so far.Using Lemma 1 we can set i = Eseen[i].Let Pj0(i) be the probability that cvij = 0 for sorted list Lj.
Moreover, Consider P0(i, k) be the probability that after iteration i, cvi = 0 for k sorted lists and Q k is corresponding attribute set.Therefore, For a given setting that cvi = 0 for k sorted lists, the algorithm stops, iff there exists at least one tuple among the discovered ones that dominate the maximum possible value in m sorted lists; i.e. the value combination that has 0 in k and 1 in all the remaining m − k positions.
A tuple t need to have the value 1 in all the m − k list and also at least one value 1 in one of the k lists (Q k ) to dominate the maximum possible remaining value.The probability that a given tuple satisfies this condition is: Thus, the probability of having at least one tuple that satisfies the dominating condition is: We now can compute the probability distribution of the algorithm cost as following: Finally, the expected number of sorted access performed by TA-SKY is:

Figure 4 :
Figure 1: Tree structure for relation in Example 1

1 :
are dominated by any tuple in L[j] are eliminated.Finally, skylines in L[i] are then discovered in recursive manner (lines 10-15).Algorithm 5 ST-P Input: Tuple list T and query Q;

Figure 5 :
Figure 5:Expected cost of IS-DOMINATED and PRUNE-DOMINATED-TUPLES operations as a function of s probability of following left edge (edges corresponds value 0) of a tree node decreases.The above simulations show that the tree structure can reduce the cost of dominance test effectively thus improving the overall performance of ST algorithms.Although the analysis has been carried out for i.i.d.data, our experimental results in §6 show similar behavior for other types of datasets.

Figure 20 :
Figure 20: Nodes traversed by TOP-DOWN Algorithm

6 Figure 21 :
Figure 21: Expected number of nodes queried vs. query length Limitation: We use Equation 10 to compute |C(l)| as a function of |Q| over three uniform relations containing one million tuples with cardinality 2, 4, and 6 respectively.The expected cost increases exponentially as we increase the query length.Moreover, the expected cost also increases when the attributes in Q have higher cardinality.

Table 1 :
A sample hosts dataset ).A tuple t ∈ D dominates a tuple t ∈ D, denoted by t t , iff ∀A ∈ A, t[A] ≥ t [A] and ∃A ∈ A, t[A] > t [A].Moreover, a tuple t ∈ D is not comparable with a tuple t ∈ D, denoted by t ∼ t , iff t t and t t.Definition 2. (Skyline).Skyline, S, is the set of tuples that are not dominated by any other tuples in D, i.e.: S = {t ∈ D| t ∈ D s.t.t t}

Table 2 :
Table of notations

Table 3 :
Example 1 relation Else, the leaf node is deleted from the tree.Upon return from the recursion, we check if both the left and right child of the current (internal) node are empty.

1 :
Input: Tuple t, Node n, Level l, Score s, Query Q; 2: if n is N one or n.minScore > s return 3: if l == |Q| + 1 and score(tQ) = n.score:Both lef t and right children of n is N one 13:

Table 4 :
Example In order to incorporate categorical attributes, each node u at level l (1 ≤ l ≤ m) of the tree now should have |Dom(A l )| children, one for each attribute value v ∈ Dom(A l ).We shall index the edges from left to right, where the left most edge corresponds to the lowest attribute value and the attribute value corresponding to each edge increases as we move from left most edge to right most edge.INSERT(t): After reaching a node u at level l, select the t[A l ]-th child of u for moving to the next level of the tree.

:
Input: Array values, Sorted lists LQ; 2: Output: List of tuples that have the same attribute value assignment as values.3: tupleIDSet = ∅ 4: for i = 1 to len(values) do