Synthesis of Allowlists for Runtime Protection against SQLi

Data is the new oil. This metaphor is commonly used to highlight the fact that data is a highly valuable commodity. Nowadays, much of worldwide data sits in SQL databases and transits through various web-based applications. As the value of data increases and attracts more attention from malicious actors, application protections against SQL injections need to become more sophisticated. Although SQL injections have been known for many years, they are still one of the top security vulnerabilities. For example, in 2022 more than 1000 CVEs related to SQL injection were reported. We propose a runtime application protection approach that infers and constrains the information that can be disclosed by database-backed applications. Where existing approaches use syntax or hand-crafted features as a proxy for information disclosure, we propose a lightweight, but precise, information disclosure model that faithfully captures the semantics of SQL and achieves finer-grain security.


INTRODUCTION
A common way of securing web applications against cyber threats is by using a Web Application Firewall (WAF) that monitors HTTP(S) traffic and blocks suspicious requests.To identify SQL injections (SQLi), WAFs analyse request payloads for syntactic anomalies.For instance, payloads containing apostrophes, comments, or SQL keywords could be blocked as potentially dangerous.
While effective at deterring simple SQLi attacks, WAFs can be evaded using malicious payloads whose syntax resemble benign inputs.For instance, an attacker can send a harmless-looking hexadecimal string that will be executed as code or use functions like concat and chr to obfuscate malicious payloads.WAF's analysis is also limited to network payloads, which may not contain SQL fragments but still influence how SQL queries are built at runtime.Such payloads are typically application-specific and difficult to detect at the network level.
A more robust prevention of SQLi attacks is offered by database firewalls or runtime application self protection (RASP) tools that intercept and check incoming SQL queries before they reach the database.Checking queries for SQLi is delegated to a security policy that is typically generated from known benign queries.
A popular way of detecting SQLi attacks is by comparing the parse trees of known benign and incoming queries.For example, Sofia [6] is a SQLi detector that uses parse tree similarity to cluster known benign queries and report incoming queries that are too distant from any benign cluster as potential attacks.SEPTIC [12], on the other hand, generates context-sensitive policies (or profiles) as parse-tree signatures.At runtime, a SQLi attack is reported if the signature of an incoming query does not match any pre-recorded benign signatures.Other techniques utilising parse trees for prevention of SQLi attacks include [2] and [4].
Because they report all syntactic deviations as SQLi, parse treebased methods can be prone to false positives and negatives.Dynamic yet benign queries that add or remove predicates e.g., a search and filter functionality, can be rejected while syntactically similar yet malicious queries, e.g., mimicry attacks [19] can be accepted.Syntactic properties alone are thus insufficient to accurately model benign queries.In an attempt to produce syntax-agnostic models, SQLBlock [9] generates context-sensitive "profiles" that capture features such as table names, logical operators, functions and types of queries (e.g., SELECT) emitted from a given function in the application.At runtime, incoming queries are allowed if all their features are listed in the profile matching their calling function.In practice, SQLBlock has been shown to be effective against common SQLi attacks that access system tables, execute new functions or use additional operators.However, depending on the training queries, SQLBlock can also be prone to false positives and negatives, as highlighted in the motivating examples below.Furthermore, because SQLBlock operates at the granularity of database tables, it provides limited protection against data ex-filtration attacks targeting specific columns or rows.
Column and row granularity is typically required by applications that need to achieve a trade-off between performance (e.g., minimising the number of tables and table joins), maintainability (e.g., minimising the number user-role-or application-specific views) and privacy and might not be obvious from the open-source applications that have been studied in past work.This need has however been recognised by several database vendors who offer various flavours of row-and column-based access control [1,[14][15][16][17].
Motivating Example.Consider a "members" table, as shown in Table 1, that contains sensitive information, highlighted in red.Next, assume that the following queries pertaining to the "members" In summary, existing approaches are designed to prevent common SQLi attacks that alter the syntax of queries or attempt to access additional tables, functions and operators.As shown in our examples, however, using syntax or hand-picked features as a proxy for information disclosure is fundamentally limited and might result in false positives and negatives that will limit usability, privacy and the desired security posture.
In this paper we present a novel allowlist-based technique to prevent data ex-filtration attacks at the application level, before queries reach the database.Similar to SQLBlock, our technique automatically synthesises context-sensitive security policies from benign queries.Unlike existing bodies of work, however, our approach uses an information flow-based allowlist that is derived from the ASTs of the queries.At runtime, incoming queries are allowed if they not disclose more information than permitted by the allowlist.Queries that violate this condition are flagged as malicious and rejected.The following sections detail the process by which we synthesise allowlists, in the form of generalised abstract queries, from sets of benign queries.

OVERVIEW
The idea of synthesising SQL queries is not new.Past work has included synthesis from natural language descriptions [21], input examples [20], or using machine learning techniques from features in SQL queries [3].Our work is focused on generating an abstract semantic query that over-approximates a collection of queries and based on a simplified version of the information flow model from Guarnieri et al. [8].As the aim is not to prove non-interference, we do not consider explicit security labels associated with a tracebased semantics.This simplified model is generalised as part of the synthesis process.

Information Model
Our information model of a SQL query is based on disclosurecolumns disclosed to the user (e.g., via a select statement but can also include INSERT, UPDATE or DELETE), access -columns used to compute the disclosed information, and predicates -conditions for disclosure to occur expressed as a boolean formula.For a query , we let  () denote disclosed columns, () denote accessed, and  () denote predicates.For example, in the following query SELECT dept, email FROM members WHERE lname = 'Baker';  () is {dept, email}, () is {lname}, and  () is lname = 'Baker'.
Information Tuple.In a query, different columns can be disclosed under different conditions.For instance, the following example that uses a LEFT JOIN returns all rows from the left (members) table even if the condition in the ON clause evaluates to false.The structural graph for the above query is shown in Figure 1.Note the difference in direction of information flow between the join and the tables.This models LEFT JOIN used in this query, where ON predicates (associated only with table orders) are not reachable from the members table graph node.For each disclosed column in the query we compute an information tuple by identifying the table of that tuple and traversing the graph starting with that node.The predicates part of the tuple is constructed by conjunction of all reachable predicates.The set of accessed columns is then constructed from all encountered columns.For instance, to compute the information tuple for O.info we first locate the table node (O) and collect predicates from reachable ON and WHERE nodes that generates tuple ⟨O.info, {O.email, M.email, M.dept}, M.email = O.email ∧ M.dept = 'HR'⟩.The information tuple of M.lname is then ⟨M.lname, {M.dept}, M.dept = 'HR'⟩.

Figure 1: Example Structural Graph
In summary, structural decomposition represents a SQL query as a set of information tuples, where each tuple captures information about a distinct disclosed column.We let Inf() denote the set of information tuples of a query .

Information Disclosure
Given two information tuples  1 = ⟨ 1 ,  1 ,  1 ⟩ and  2 = ⟨ 2 ,  2 ,  2 ⟩ we say that  1 discloses no more information than  2 (written as  1 ⪯  2 ) if the following three conditions hold. (1) Query Disclosure.Given two queries,  1 and  2 , we say that  1 discloses no more information than  2 if for each information tuple  in Inf( 1 ) there exists a tuple  ′ in Inf( 2 ) such that  ⪯  ′ .
In the context of security checking we can say that an input query   is malicious with respect to an allowlist given by query   , if the property   does not disclose more information than   .

GENERALISATION
We construct an allowlist of permitted queries by using a training set of known benign queries.Then, an incoming query   is permitted by the allowlist A = { 1 , . . .,   } if A contains some query   that discloses at least as much information as   (i.e.,   ⪯   ).Since a training set of queries is rarely complete, however, such an allowlist is likely to lead to false positives.Furthermore, a large set of training queries can result in performance bottlenecks because rejecting a malicious query requires checking disclosure of all queries in the allowlist.
Recall that a query  is represented by a set of information tuples Inf() obtained from structural decomposition.The generalised query  + can be represented by a collection of generalised information tuples, i.e., the problem of query generalisation can be reduced to generalisation of information tuples.The generalised query is synthesised from the initial allowlist A = { 1 , . . .,   } as follows: (1) Use structural decomposition to generate the set of information tuples of A, i.e., {Inf( 1 ) ∪ . . .∪ Inf(  )}.
One way to generalise information tuples is by combining respective predicates and access columns.For example, generalisation over ⟨, {, },  1 ⟩ and ⟨, {, },  2 ⟩ becomes ⟨, {, , },  1 ∨  2 ⟩.Such an approach, however, may be considered too permissive.For example, for disclosure of the column , a query accessing columns  and  together will be permitted even though it has never been observed.Furthermore, access to column  will now be permitted under the predicate  1 that has not been observed either.
To resolve this issue, we extend information tuples to capture groups of columns observed together in the training set along with their respective predicates.We now show synthesis of an extended information tuple using a simple example.Consider information tuples  1 = ⟨, {, },  1 ⟩ and  2 = ⟨, {, },  2 ⟩.These tuples both disclose column  and therefore can be generalised.An extended information tuple combining  1 and  2 is shown below (the disclosed column  is omitted to simplify the presentation).

Accessed Predicate Correlation
A row of the table above describes an accessed column of the tuple, the predicate the column can be accessed under and its correlated set (i.e., set of columns it can be used with).The first row thus shows that column  can be accessed under predicate  1 and ICSE-NIER'24, April 14-20, 2024, Lisbon, Portugal Kostyantyn Vorobyov, François Gauthier, and Padmanabhan Krishnan together with column .This information is gathered from  1 tuple exclusively because  has not been observed in  2 .Column , on the other hand, appears in both tuples, therefore its correlated set is {, } and the predicate is a disjunction of predicates from tuples where it appears ( 1 ∨  2 ).
More formally, an extended information tuple is a quadruple ⟨, , pred, corr⟩, where  is a disclosed column,  is a set of accessed columns, pred is a mapping over  that associates accessed columns with predicates, and corr is a mapping over  that associates accessed columns with sets of their correlated columns, e.g., in the example above  is {, , }, pred() is  1 ∨  2 and corr() is {, }.
In summary, the generalised allowlist A  consists of a set of extended information tuples that disclose distinct columns.We use Inf  ({ 1 , . . .,   }) to denote generalisation over a training query set { 1 , . . .,   }.
Predicate Generalisation.Another dimension to generalisation is the generalisation over predicates generated from queries.Overall, standard techniques (e.g., abstract interpretation [10]) can be used for predicate generalisation.One simple yet practical approach is the generalisation over constants, e.g., replacing all integers by 1, all strings by 'S' and so on.The state-of-the-art tools such as Sofia or SQLBlock employ similar strategies and discard constants in subtrees or profiles.Another practical option is to use range abstraction [5,18].This type of generalisation is relevant for queries accepting limited numeric inputs.For more complex expressions domains such as Pentagons [11] or Octagons [13] can be used.
While specific predicate abstraction and generalisation is beyond the scope of this paper, we consider predicate generalisation to be a necessary step in generating a robust allowlist.
Cardinality Generalisation.Another dimension to generalisation is the generalisation over table columns.For instance, if a query accesses more columns of a table (say  ) than some configured threshold then the generalisation can allow access to all columns of that table.In a given extended information tuple this generalisation should update both the set of accepted columns and the correlation mapping to include all columns of  .

Generalised Information Disclosure
Given two extended information tuples  1 = ⟨ 1 ,  1 , pred 1 , corr 1 ⟩ and  2 = ⟨ 2 ,  2 , pred 2 , corr 2 ⟩ we say that  1 discloses no more information than  2 (denoted as Allowlist Checking.Let A  = Inf  ({ 1 , . . .,   }) be a generalised allowlist and   be an incoming query.We say that   is accepted by the generalised allowlist A  if for each extended information tuple   in Inf  ({  }) there exists an extended information tuple   in A  such that   ⪯   .Having no predicates our approach considers that this query accesses all columns including SSN.Since the allowlist does not permit access to SSN (even if it has not been disclosed) the query is rejected.

Example
SELECT email FROM members WHERE dept = 'HR' AND lname = 'Baker';

↩→
Our approach permits this query because it discloses no more information than specified by the allowlist, i.e., (1) information tuple disclosing email exists, (2) access to dept and lname has been observed, (3) correlation permits using dept and lname together and (4) the predicate checks under predicate generalisation hold.
SELECT * FROM members WHERE dept = 'HR'; This query is rejected because it discloses all column including SSN, whereas the allowlist does not permit disclosure of that column.

FUTURE WORK
In this paper we have described how finer-grain information protection in a world where "data is the new oil" [7] can be achieved.Our work is far from complete, however.Our allowlist synthesis strategy is designed to be used for runtime application protection, where overheads and false positives must be low.Achieving an acceptable trade-off between security, usability and performance will require careful experimentation on real-world applications and vulnerabilities.For example, we are currently investigating how context-sensitivity -synthesising and enforcing different allowlists for different execution contexts (e.g.call stacks) -could help us synthesise tighter allowlists, especially in applications that emit a wide spectrum of SQL queries.We are also considering various caching and optimisation strategies to lower the overhead of our runtime protections.

SELECT
fname || ' ' || lname AS name FROM Members M HAVING dept IN ('Sales', 'HR'); ↩→ Even though syntactically different from the training queries our approach allows it.The query discloses fname and lname columns under predicate dept = 'S' obtained by generalisation of the predicate build from the IN expression.This query therefore discloses no more information than the first query in the training set.
table were collected from the application under test: The addition of an AND operator would cause SQLBlock to reject this legitimate query.In the next example, SELECT * FROM members WHERE dept = 'HR'; the * operator causes the disclosure of sensitive social security numbers (SSN).Reasoning about the * requires knowledge of the database schema, which is not available in the AST.Hence SQL-Block would allow this query.The final example, SELECT fname || ' ' || lname AS name FROM Members M HAVING dept IN ('Sales', 'HR'); ↩→ discloses similar information than training queries (e.g.blue and yellow cells), it is syntactically different and introduces several new operators.This benign query would be rejected outright by syntax-based approaches and SQLBlock.
To account for this issue we represent a query as a set of information tuples where each tuple ⟨, , ⟩ models the disclosure of a single column , the set of accessed columns  needed for the disclosure and the predicates  leading to the disclosure.An information tuple is thus an information model of a query that discloses a single column.The above example query is represented by (or decomposed into) two tuples: ⟨M.lname, ∅, ⟩ and ⟨O.info, {M.email, O.email}, M.email = O.email⟩ Query Decomposition.Computing the decomposition of a query into a set of information tuples is done by traversing a graph that represents the semantics of the query in a structural form, called a structural graph.The nodes of the structural graph correspond to syntactic elements (e.g., SELECT, WHERE) and annotated with predicates.The edges of the structural graph capture direction of information flow.While supporting the full semantics of SQL is beyond the scope of this paper, we explain key concepts using a simple example shown below. of Allowlists for Runtime Protection against SQLi ICSE-NIER'24, April 14-20, 2024, Lisbon, Portugal SELECT M.lname, O.info FROM members AS M LEFT JOIN orders AS O ON M.email = O.email WHERE M.dept = 'HR' SELECT M.lname, O.info FROM members AS M LEFT JOIN orders AS O ON M.email = O.email; ↩→ That is, O.info is disclosed under predicate M.email = O.email and requires access to both email columns, whereas lname is disclosed unconditionally.
SELECT lname, fname, email FROM members WHERE dept = 'HR'; SELECT dept, email FROM members WHERE lname = 'Baker' OR dept = 'Sales'; SELECT email FROM members;