Complex SQL Queries Best Practices

Structured Query Language (SQL) is a standard language for defining, modifying, and controlling data within relational databases. A basic SQL query typically involves selecting data from tables, applying filters, and sorting the results.

However, advanced SQL queries become necessary when data structure and business logic complexity increase. They enable intricate data transformations at scale in areas like ETL (Extract, Transform, Load), data warehousing, data analysis, and reporting.

Key techniques in complex SQL queries include subqueries, joins, unions, intersections, aggregate functions, window functions, common table expressions (CTEs), pivoting, recursive queries, string manipulation, date and time functions, and case statements. This article provides an overview of each technique, along with best practices to achieve precise and efficient data processing.

Summary of complex SQL Queries techniques

Concept	Description
Common Table Expression (CTE)	CTEs preserve query results, allowing queries to reference them.
Recursive queries	Recursive queries process hierarchical or self-referencing data. They are typically built using CTEs.
Window functions	Calculates values over a defined row window but related to the current row, enabling complex analytics.
Union and intersection	Combines results from multiple queries (UNION for all, INTERSECT for common results).
Pivoting and unpivoting	Transforms data from rows to columns, and vice versa, for comparative analysis.
Case statements	Performs conditional logic within an SQL query.
String manipulation	Performs operations like concatenation, trimming, and substring extraction, including regular expressions.
Date and time functions	Handles date and time calculations, comparisons, and transformations.

Recursive queries and CTEs

Recursive queries allow you to link one record with others in the same table, making them suitable for handling hierarchical data structures such as organizational charts or product dependencies. They are typically built using Common Table Expressions (CTEs).

CTEs simplify complex queries by breaking them into smaller, reusable subparts. It involves defining the subquery with a meaningful name so that it can be referenced multiple times within the main query. CTEs improve readability and allow you to build queries modularly, making debugging and testing more efficient.

Example

Consider an EMPLOYEE table with a column called ManagerID, which refers to that employee's manager. This allows for many levels of employees and managers.

EmployeeID	EmployeeName	ManagerID
1	Sarah	NULL
2	John	1
3	Michelle	1
4	Frank	2
5	Jerry	4
6	Paula	4

We can use a recursive query with CTE to retrieve the hierarchy of employees with their levels.

WITH RECURSIVE OrgHierarchy AS (
-- Anchor Member: Start with the top-level manager (no ManagerID)
    SELECT
        EmployeeID,
        EmployeeName,
        ManagerID,
        1 AS Level
    FROM Employee
    WHERE ManagerID IS NULL

    UNION ALL

-- Recursive Member: Fetch employees reporting to the current level
    SELECT
        e.EmployeeID,
        e.EmployeeName,
        e.ManagerID,
        oh.Level + 1 AS Level
    FROM Employee e
    INNER JOIN OrgHierarchy oh
        ON e.ManagerID = oh.EmployeeID
)
-- Final Query: Retrieve the hierarchical structure
SELECT
    EmployeeID,
    EmployeeName,
    ManagerID,
    Level
FROM OrgHierarchy
ORDER BY Level, EmployeeID;

The RECURSIVE keyword is not required if you're using Oracle or SQL Server. The recursion is implicit when the CTE references itself. If you're using MySQL or Postgres, you'll need to add the word RECURSIVE after the WITH keyword, as shown above.

The query results in the following:

EmployeeID	EmployeeName	ManagerID	Level
1	Sarah	Null	1
2	John	1	2
3	Michelle	1	2
4	Frank	2	3
5	Jerry	4	4
6	Paula	4	4

Best practices

Clearly define when recursion should stop. The recursion stops automatically in the above example when there are no more rows to match, ensuring finite results. You can also avoid infinite recursions by explicitly limiting the recursion depth using a WHERE condition or SQL options like MAXRECURSION (in SQL Server) or LEVEL (in Oracle).

Keep the recursive member lightweight by selecting only required fields and avoiding complex calculations. Ensure proper indexing (e.g., on ManagerID) for efficient joins in recursive members.

Window functions

Window functions perform calculations across a defined row window relative to the current row without collapsing the rows into a single result as grouped aggregations do. They are particularly useful for running totals, moving averages, rank calculations, and comparisons between rows.

Ranking functions

ROW_NUMBER() assigns a unique sequential number to each row within a partition, starting from 1.
RANK() assigns a rank to each row within a partition, with gaps in ranking if there are ties.
DENSE_RANK() is similar to RANK() but without gaps in ranking for ties.

We can see how these functions rank data using the below table ORDERS containing order details.

OrderID	OrderDate	CustomerID	OrderAmount	Region
101	2024-11-01	1	250.00	North
102	2024-11-01	2	300.00	North
103	2024-11-02	3	300.00	North
104	2024-11-02	4	200.00	South
105	2024-11-03	5	150.00	South
106	2024-11-03	6	200.00	South

The below query calculates the row number, RANK(), and DENSE_RANK() of an order based on the order amount within a region:

SELECT
    Region,
    OrderID,
    CustomerID,
    OrderAmount,
    ROW_NUMBER() OVER (PARTITION BY Region ORDER BY OrderAmount DESC) AS RowNumber,
    RANK() OVER (PARTITION BY Region ORDER BY OrderAmount DESC) AS Rank,
    DENSE_RANK() OVER (PARTITION BY Region ORDER BY OrderAmount DESC) AS DenseRank
FROM Orders
ORDER BY Region, OrderAmount DESC;

Result:

Region	OrderID	CustomerID	OrderAmount	RowNumber	Rank	DenseRank
North	102	2	300.00	1	1	1
North	103	3	300.00	2	1	1
North	101	1	250.00	3	3	2
South	104	4	200.00	1	1	1
South	106	6	200.00	2	1	1
South	105	5	150.00	3	3	2

Offset functions

LAG() fetches the value of a column from a preceding row within the same partition.
LEAD() fetches the value of a column from a subsequent row within the same partition.

SELECT
    Region,
    OrderID,
    CustomerID,
    OrderAmount,
    LAG(OrderAmount) OVER (PARTITION BY Region ORDER BY OrderAmount DESC) AS PreviousOrderAmount,
    LEAD(OrderAmount) OVER (PARTITION BY Region ORDER BY OrderAmount DESC) AS NextOrderAmount
FROM Orders
ORDER BY Region, OrderAmount DESC;

Result:

Region	OrderID	CustomerID	OrderAmount	PreviousOrderAmount	NextOrderAmount
North	102	2	300.00	NULL	300.00
North	103	3	300.00	300.00	250.00
North	101	1	250.00	300.00	NULL
South	104	4	200.00	NULL	200.00
South	106	6	200.00	200.00	150.00
South	105	5	150.00	200.00	NULL

Percentile (bucket) function

NTILE(n) divides rows into "n" equal groups and assigns a bucket number to each row. It is useful for creating histograms, analyzing rankings, or splitting data to enable parallel processing:

SELECT
 Region, OrderID, CustomerID, OrderAmount,
 NTILE(2) OVER (PARTITION BY Region ORDER BY OrderAmount DESC) AS Bucket
FROM Orders ORDER BY Region, OrderAmount DESC;

Result:

Region	OrderID	CustomerID	OrderAmount	Bucket
North	102	2	300.00	1
North	103	3	300.00	1
North	101	1	250.00	2
South	104	4	200.00	1
South	106	6	200.00	1
South	105	5	150.00	2

Aggregates

Standard aggregate functions like SUM(), AVG(), MIN(), and MAX() can be used as window functions, allowing calculations over a row window. The query below uses SUM() as a window function to calculate the cumulative or running sum:

SELECT
    Region,
    OrderID,
    CustomerID,
    OrderAmount,
    SUM(OrderAmount) OVER (PARTITION BY Region ORDER BY OrderAmount DESC ROWS UNBOUNDED PRECEDING) AS RunningSum
FROM Orders
ORDER BY Region, OrderAmount DESC;

Union and intersection

UNION, UNION ALL, and INTERSECT are set operators useful when combining or comparing datasets in various analytical and operational contexts.

The UNION operator combines results from two or more queries into a single result set, removing duplicates by default. The UNION ALL operator performs better when duplicate removal is unnecessary.

UNION vs. UNION ALL

Feature	UNION	UNION ALL
Duplicates	Removes duplicates	Keep all rows, including duplicates
Performance	Slower (due to duplicate removal)	Faster
Use case	When unique results are needed	When all results are needed

Suppose customer data is stored in two different tables, OnlineCustomers and OfflineCustomers. Some customers may have purchased both online and offline, resulting in duplicates. Union provides a de-duplicated list of customers from both tables:

SELECT CustomerID, Name, Email FROM OnlineCustomers
UNION
SELECT CustomerID, Name, Email FROM OfflineCustomers;

Similarly, there are scenarios in which all records are needed, even if they are repeated. Consider sales data stored in two tables: OnlineSales and OfflineSales. You need all sales records, even if some transactions (e.g., the same product purchased by the same customer online and offline) are repeated:

SELECT CustomerID, ProductID, Quantity, SalesAmount FROM OnlineSales
UNION ALL
SELECT CustomerID, ProductID, Quantity, SalesAmount FROM OfflineSales;

On the other hand, the INTERSECT operator returns only the rows that are common to all involved queries:

SELECT CustomerID, Name, Email FROM OnlineCustomers
INTERSECT
SELECT CustomerID, Name, Email FROM OfflineCustomers;

The INTERSECT operator is not supported in all SQL dialects (e.g., MySQL versions before 8.0 do not support it). If unavailable, you can achieve similar functionality using INNER JOIN or subqueries.

Best practices

Ensure compatibility — the number and order of columns must match across queries with compatible data types.
Use UNION ALL for performance if duplicate removal is unnecessary.
Filter early by applying WHERE conditions in individual queries to reduce data before combining.
Ensure indexed columns are used in INTERSECT and filtering operations for better performance.

Pivoting and unpivoting

Pivoting and unpivoting are complex SQL queries that transform data structures for better analysis and reporting. Pivoting converts rows into columns, which is useful for creating summary or comparative views. Here's an example of pivoting to display the total order amount for each customer by year.

CUSTOMERS

CustomerID	CustomerName
1	Alice
2	Bob

ORDERS

OrderID	CustomerID	OrderDate	OrderAmount
101	1	10-05-2019	200
102	1	15-07-2020	300
103	2	12-09-2019	150
104	2	08-03-2021	400

SELECT *
FROM (
    SELECT c.CustomerName, YEAR(o.OrderDate) AS OrderYear, o.OrderAmount
    FROM Customers c
    INNER JOIN Orders o ON c.CustomerID = o.CustomerID
) AS SourceData
PIVOT (
    SUM(OrderAmount)
    FOR OrderYear IN ([2019], [2020], [2021])
) AS PivotTable;

The inner query prepares the source data with customer names, years, and order amounts. The PIVOT operator transforms values of OrderYear (2019, 2020, 2021) into columns. The SUM(OrderAmount) aggregates order amounts for each customer name and year.

PIVOTEDDATA

CustomerName	2019	2020	2021
Alice	200	300	NULL
Bob	150	NULL	400

Unpivoting converts columns into rows, which is useful for normalizing data or restructuring for analytical purposes.

The below query unpivots the pivoted data to transform the year columns (2019, 2020, 2021) back into rows:

SELECT CustomerName, SalesYear, SalesAmount
FROM (
    SELECT CustomerName, [2019], [2020], [2021]
    FROM PivotedData
) AS PivotedData
UNPIVOT (
    SalesAmount FOR SalesYear IN ([2019], [2020], [2021])
) AS UnpivotedData;

Result:

Product	SalesYear	SalesAmount
Alice	2019	200
Alice	2020	300
Bob	2019	150
Bob	2021	400

However, it's important to note that these operations are dialect-specific. While SQL Server and Oracle natively support the PIVOT and UNPIVOT operators, databases like MySQL and PostgreSQL do not. In such cases, alternative methods like conditional aggregation (for pivoting) or JOIN/UNION operations (for unpivoting) achieve similar results.

For example, the below query achieves pivoting by using a combination of CASE statements and aggregation functions like SUM:

SELECT
    c.CustomerName,
    SUM(CASE WHEN YEAR(o.OrderDate) = 2019 THEN o.OrderAmount ELSE 0 END) AS [2019],
    SUM(CASE WHEN YEAR(o.OrderDate) = 2020 THEN o.OrderAmount ELSE 0 END) AS [2020],
    SUM(CASE WHEN YEAR(o.OrderDate) = 2021 THEN o.OrderAmount ELSE 0 END) AS [2021]
FROM
    Customers c
INNER JOIN
    Orders o ON c.CustomerID = o.CustomerID
GROUP BY
    c.CustomerName;

The below query achieves unpivoting using a UNION ALL query, manually stacking columns into rows:

SELECT CustomerName, '2019' AS SalesYear, [2019] AS SalesAmount
FROM PivotedData
UNION ALL
SELECT CustomerName, '2020' AS SalesYear, [2020] AS SalesAmount
FROM PivotedData
UNION ALL
SELECT CustomerName, '2021' AS SalesYear, [2021] AS SalesAmount
FROM PivotedData;

Case statements

Case statements allow you to perform conditional logic within a SQL query. They are useful for tasks such as data categorization or creating custom calculations based on specific conditions.

The below query uses a case statement to implement bonus calculation logic:

SELECT
    Name,
    Sales,
    CASE
        WHEN Sales >= 700 THEN Salary * 0.10
        WHEN Sales >= 500 THEN Salary * 0.05
        ELSE Sales * 0.02
    END AS Bonus
FROM Employees;

String manipulation

String manipulation functions enable text data modifications and analysis within queries, such as concatenation, trimming, and extracting substrings. These functions are especially useful in ETL processes, data cleaning, and reporting.

CONCAT combines multiple strings into one:

SELECT CONCAT(first_name, ' ', last_name) AS full_name
FROM employees;

TRIM removes leading and/or trailing whitespace or specified characters.
SUBSTRING in MySQL (or SUBSTR in SQL Server) extracts a portion of a string.
INSTR in MySQL (or CHARINDEX in SQL Server) finds the position of a substring within a string.

SELECT
    email,
    SUBSTRING(email, INSTR(email, '@') + 1) AS domain
FROM users;

Result:

email	domain
john.doe@gmail.com	gmail.com
alice.smith@xyz.com	xyz.com

REPLACE substitutes occurrences of a substring with another substring:

SELECT
    phone_number,
    REPLACE(phone_number, '-', '') AS standardized_phone
FROM contacts;

Result:

phone_number	standardized_phone
123-456-7890	1234567890
987-654-3210	9876543210

Regular expressions

Regular expressions are used for pattern matching, text manipulation, and validations within SQL. Databases like MySQL and PostgreSQL support regex natively, while others, such as SQL Server, may require additional functions or external libraries.

Example — validating email addresses in MySQL:

SELECT
    FullName,
    CASE
        WHEN Email REGEXP '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$' THEN 'Valid'
        ELSE 'Invalid'
    END AS EmailStatus
FROM Users;

Note: SQL syntax and function names for string manipulation (e.g., CHARINDEX in SQL Server vs. INSTR in MySQL) can vary across dialects. Always check the documentation for your specific SQL system to ensure compatibility.

Result:

FullName	Email	EmailStatus
John Doe	John.doe@gmail.com	Valid
Emily Susan	Emily.susan@xyz	Invalid

Date and time functions

Date and time functions facilitate temporal analysis and transformations, such as calculating intervals, converting date formats, and comparing time-related data.

DATEPART extracts a specific part of a date, such as a year, month, day, or hour:

-- SQL Server
SELECT
    DATEPART(YEAR, GETDATE()) AS Year,
    DATEPART(MONTH, GETDATE()) AS Month,
    DATEPART(DAY, GETDATE()) AS Day;

-- PostgreSQL/Oracle
SELECT
    EXTRACT(YEAR FROM CURRENT_DATE) AS Year,
    EXTRACT(MONTH FROM CURRENT_DATE) AS Month,
    EXTRACT(DAY FROM CURRENT_DATE) AS Day;

DATEADD adds a specified interval to a date. For example, adding 7 days to the current date:

-- SQL Server
SELECT DATEADD(DAY, 7, GETDATE()) AS NewDate;

-- PostgreSQL
SELECT CURRENT_DATE + INTERVAL '7 days' AS NewDate;

-- Oracle
SELECT CURRENT_DATE + 7 AS NewDate;

DATEDIFF calculates the difference between two dates:

-- SQL Server
SELECT DATEDIFF(DAY, '2023-11-01', GETDATE()) AS DaysDifference;

-- PostgreSQL
SELECT EXTRACT(DAY FROM (CURRENT_DATE - DATE '2023-11-01')) AS DaysDifference;

-- Oracle
SELECT TRUNC(CURRENT_DATE - DATE '2023-11-01') AS DaysDifference;

FORMAT is useful for formatting dates:

-- SQL Server
SELECT FORMAT(GETDATE(), 'yyyy-MM-dd HH:mm:ss') AS FormattedDate;

-- PostgreSQL/Oracle
SELECT TO_CHAR(CURRENT_DATE, 'YYYY-MM-DD HH24:MI:SS') AS FormattedDate;

EOMONTH/LAST_DAY is useful for finding the last day of the month:

-- SQL Server
SELECT EOMONTH(GETDATE()) AS LastDayOfMonth;

-- Oracle
SELECT LAST_DAY(CURRENT_DATE) AS LastDayOfMonth;

--- PostgreSQL
SELECT (DATE_TRUNC('month', now()) + interval '1 month - 1 day');

Recommendations for writing complex SQL queries

Keep it simple

Avoid using complex joins, recursive functions, and nested aggregations unless absolutely necessary. Overly complex queries can be difficult to debug and may affect performance.

Build query bottom-up

Start by testing the basic query requirements. Then, introduce complexity step-by-step. This approach ensures that each query stage functions correctly before adding more layers.

Use more readable syntax

Use proper formatting, indentation, and comments to ensure clarity. Where applicable, simplify type conversions and transformations with shorthand syntax.

Adopt a declarative approach

Use a declarative approach to build reusable code blocks and make your SQL logic more modular and understandable. Declarative configurations emphasize specifying what to do rather than how to do it. They tend to be easier to debug, reuse, and optimize.

Unfortunately, SQL statements are only declarative in isolation. Multi-statement SQL scripts dictate the exact order of persistence (e.g., using INSERT and UPDATE in a specific order) and tend to become rigid and monolithic.

DataForge is an open-source, next-generation data transformation tool that improves SQL's declarative approach. It provides more flexibility and declarative power, especially for complex data transformations and processing tasks.

For example, we could define a "products" source using the below code with DataForge:

source_name: "Products"
source_query: "SELECT * FROM Products"
raw_attributes:
- product_id int
- product_name string
- average_cost decimal
- average_sell_price decimal
rules:
 - name: "Round Average Cost"
   expression: "ROUND([This].average_cost, 2)"
 - name: "Truncate Product Name"
   expression: "CASE
     WHEN LENGTH([This].product_name) > 30
     THEN CONCAT(LEFT([This].product_name, 30), '...')
     ELSE [This].product_name
   END"

In this declarative approach, DataForge allows users to define queries in terms of transformations and conditions rather than explicit step-by-step updates. This configuration creates reusable rules (e.g., rounding or truncating) and enforces modular, maintainable transformations.

Conclusion

SQL offers a powerful suite of tools for data manipulation and transformation. By mastering techniques like recursive queries, window functions, CTEs, pivoting, and date/time manipulation, engineers can perform intricate data analysis and report directly within SQL, streamlining processes and enhancing efficiency.