Complex SQL Queries Best Practices

Structured Query Language (SQL) is a standard language for defining, modifying, and controlling data within relational databases. A basic SQL query typically involves selecting data from tables, applying filters, and sorting the results.

However, advanced SQL queries become necessary when data structure and business logic complexity increase. They enable intricate data transformations at scale in areas like ETL (Extract, Transform, Load), data warehousing, data analysis, and reporting.

Key techniques in complex SQL queries include subqueries, joins, unions, intersections, aggregate functions, window functions, common table expressions (CTEs), pivoting, recursive queries, string manipulation, date and time functions, and case statements. This article provides an overview of each technique, along with best practices to achieve precise and efficient data processing.

Summary of complex SQL Queries techniques

Concept Description
Common Table Expression (CTE) CTEs preserve query results, allowing queries to reference them.
Recursive queries Recursive queries process hierarchical or self-referencing data. They are typically built using CTEs.
Window functions Calculates values over a defined row window but related to the current row, enabling complex analytics.
Union and intersection Combines results from multiple queries (UNION for all, INTERSECT for common results).
Pivoting and unpivoting Transforms data from rows to columns, and vice versa, for comparative analysis.
Case statements Performs conditional logic within an SQL query.
String manipulation Performs operations like concatenation, trimming, and substring extraction, including regular expressions.
Date and time functions Handles date and time calculations, comparisons, and transformations.

Recursive queries and CTEs

Recursive queries allow you to link one record with others in the same table, making them suitable for handling hierarchical data structures such as organizational charts or product dependencies. They are typically built using Common Table Expressions (CTEs).

CTEs simplify complex queries by breaking them into smaller, reusable subparts. It involves defining the subquery with a meaningful name so that it can be referenced multiple times within the main query. CTEs improve readability and allow you to build queries modularly, making debugging and testing more efficient.

Example

Consider an EMPLOYEE table with a column called ManagerID, which refers to that employee's manager. This allows for many levels of employees and managers.

EmployeeID EmployeeName ManagerID
1SarahNULL
2John1
3Michelle1
4Frank2
5Jerry4
6Paula4

We can use a recursive query with CTE to retrieve the hierarchy of employees with their levels.

WITH RECURSIVE OrgHierarchy AS (
-- Anchor Member: Start with the top-level manager (no ManagerID)
    SELECT
        EmployeeID,
        EmployeeName,
        ManagerID,
        1 AS Level
    FROM Employee
    WHERE ManagerID IS NULL

    UNION ALL

-- Recursive Member: Fetch employees reporting to the current level
    SELECT
        e.EmployeeID,
        e.EmployeeName,
        e.ManagerID,
        oh.Level + 1 AS Level
    FROM Employee e
    INNER JOIN OrgHierarchy oh
        ON e.ManagerID = oh.EmployeeID
)
-- Final Query: Retrieve the hierarchical structure
SELECT
    EmployeeID,
    EmployeeName,
    ManagerID,
    Level
FROM OrgHierarchy
ORDER BY Level, EmployeeID;

The RECURSIVE keyword is not required if you're using Oracle or SQL Server. The recursion is implicit when the CTE references itself. If you're using MySQL or Postgres, you'll need to add the word RECURSIVE after the WITH keyword, as shown above.

The query results in the following:

EmployeeID EmployeeName ManagerID Level
1SarahNull1
2John12
3Michelle12
4Frank23
5Jerry44
6Paula44

Best practices

Clearly define when recursion should stop. The recursion stops automatically in the above example when there are no more rows to match, ensuring finite results. You can also avoid infinite recursions by explicitly limiting the recursion depth using a WHERE condition or SQL options like MAXRECURSION (in SQL Server) or LEVEL (in Oracle).

Keep the recursive member lightweight by selecting only required fields and avoiding complex calculations. Ensure proper indexing (e.g., on ManagerID) for efficient joins in recursive members.

Window functions

Window functions perform calculations across a defined row window relative to the current row without collapsing the rows into a single result as grouped aggregations do. They are particularly useful for running totals, moving averages, rank calculations, and comparisons between rows.

Ranking functions

  • ROW_NUMBER() assigns a unique sequential number to each row within a partition, starting from 1.
  • RANK() assigns a rank to each row within a partition, with gaps in ranking if there are ties.
  • DENSE_RANK() is similar to RANK() but without gaps in ranking for ties.

We can see how these functions rank data using the below table ORDERS containing order details.

OrderID OrderDate CustomerID OrderAmount Region
1012024-11-011250.00North
1022024-11-012300.00North
1032024-11-023300.00North
1042024-11-024200.00South
1052024-11-035150.00South
1062024-11-036200.00South

The below query calculates the row number, RANK(), and DENSE_RANK() of an order based on the order amount within a region:

SELECT
    Region,
    OrderID,
    CustomerID,
    OrderAmount,
    ROW_NUMBER() OVER (PARTITION BY Region ORDER BY OrderAmount DESC) AS RowNumber,
    RANK() OVER (PARTITION BY Region ORDER BY OrderAmount DESC) AS Rank,
    DENSE_RANK() OVER (PARTITION BY Region ORDER BY OrderAmount DESC) AS DenseRank
FROM Orders
ORDER BY Region, OrderAmount DESC;

Result:

Region OrderID CustomerID OrderAmount RowNumber Rank DenseRank
North1022300.00111
North1033300.00211
North1011250.00332
South1044200.00111
South1066200.00211
South1055150.00332

Offset functions

  • LAG() fetches the value of a column from a preceding row within the same partition.
  • LEAD() fetches the value of a column from a subsequent row within the same partition.
SELECT
    Region,
    OrderID,
    CustomerID,
    OrderAmount,
    LAG(OrderAmount) OVER (PARTITION BY Region ORDER BY OrderAmount DESC) AS PreviousOrderAmount,
    LEAD(OrderAmount) OVER (PARTITION BY Region ORDER BY OrderAmount DESC) AS NextOrderAmount
FROM Orders
ORDER BY Region, OrderAmount DESC;

Result:

Region OrderID CustomerID OrderAmount PreviousOrderAmount NextOrderAmount
North1022300.00NULL300.00
North1033300.00300.00250.00
North1011250.00300.00NULL
South1044200.00NULL200.00
South1066200.00200.00150.00
South1055150.00200.00NULL

Percentile (bucket) function

NTILE(n) divides rows into "n" equal groups and assigns a bucket number to each row. It is useful for creating histograms, analyzing rankings, or splitting data to enable parallel processing:

SELECT
 Region, OrderID, CustomerID, OrderAmount,
 NTILE(2) OVER (PARTITION BY Region ORDER BY OrderAmount DESC) AS Bucket
FROM Orders ORDER BY Region, OrderAmount DESC;

Result:

Region OrderID CustomerID OrderAmount Bucket
North1022300.001
North1033300.001
North1011250.002
South1044200.001
South1066200.001
South1055150.002

Aggregates

Standard aggregate functions like SUM(), AVG(), MIN(), and MAX() can be used as window functions, allowing calculations over a row window. The query below uses SUM() as a window function to calculate the cumulative or running sum:

SELECT
    Region,
    OrderID,
    CustomerID,
    OrderAmount,
    SUM(OrderAmount) OVER (PARTITION BY Region ORDER BY OrderAmount DESC ROWS UNBOUNDED PRECEDING) AS RunningSum
FROM Orders
ORDER BY Region, OrderAmount DESC;

Union and intersection

UNION, UNION ALL, and INTERSECT are set operators useful when combining or comparing datasets in various analytical and operational contexts.

The UNION operator combines results from two or more queries into a single result set, removing duplicates by default. The UNION ALL operator performs better when duplicate removal is unnecessary.

UNION vs. UNION ALL

Feature UNION UNION ALL
Duplicates Removes duplicates Keep all rows, including duplicates
Performance Slower (due to duplicate removal) Faster
Use case When unique results are needed When all results are needed

Suppose customer data is stored in two different tables, OnlineCustomers and OfflineCustomers. Some customers may have purchased both online and offline, resulting in duplicates. Union provides a de-duplicated list of customers from both tables:

SELECT CustomerID, Name, Email FROM OnlineCustomers
UNION
SELECT CustomerID, Name, Email FROM OfflineCustomers;

Similarly, there are scenarios in which all records are needed, even if they are repeated. Consider sales data stored in two tables: OnlineSales and OfflineSales. You need all sales records, even if some transactions (e.g., the same product purchased by the same customer online and offline) are repeated:

SELECT CustomerID, ProductID, Quantity, SalesAmount FROM OnlineSales
UNION ALL
SELECT CustomerID, ProductID, Quantity, SalesAmount FROM OfflineSales;

On the other hand, the INTERSECT operator returns only the rows that are common to all involved queries:

SELECT CustomerID, Name, Email FROM OnlineCustomers
INTERSECT
SELECT CustomerID, Name, Email FROM OfflineCustomers;

The INTERSECT operator is not supported in all SQL dialects (e.g., MySQL versions before 8.0 do not support it). If unavailable, you can achieve similar functionality using INNER JOIN or subqueries.

Best practices

  • Ensure compatibility — the number and order of columns must match across queries with compatible data types.
  • Use UNION ALL for performance if duplicate removal is unnecessary.
  • Filter early by applying WHERE conditions in individual queries to reduce data before combining.
  • Ensure indexed columns are used in INTERSECT and filtering operations for better performance.

Pivoting and unpivoting

Pivoting and unpivoting are complex SQL queries that transform data structures for better analysis and reporting. Pivoting converts rows into columns, which is useful for creating summary or comparative views. Here's an example of pivoting to display the total order amount for each customer by year.

CUSTOMERS

CustomerID CustomerName
1Alice
2Bob

ORDERS

OrderID CustomerID OrderDate OrderAmount
101110-05-2019200
102115-07-2020300
103212-09-2019150
104208-03-2021400
SELECT *
FROM (
    SELECT c.CustomerName, YEAR(o.OrderDate) AS OrderYear, o.OrderAmount
    FROM Customers c
    INNER JOIN Orders o ON c.CustomerID = o.CustomerID
) AS SourceData
PIVOT (
    SUM(OrderAmount)
    FOR OrderYear IN ([2019], [2020], [2021])
) AS PivotTable;

The inner query prepares the source data with customer names, years, and order amounts. The PIVOT operator transforms values of OrderYear (2019, 2020, 2021) into columns. The SUM(OrderAmount) aggregates order amounts for each customer name and year.

PIVOTEDDATA

CustomerName 2019 2020 2021
Alice200300NULL
Bob150NULL400

Unpivoting converts columns into rows, which is useful for normalizing data or restructuring for analytical purposes.

The below query unpivots the pivoted data to transform the year columns (2019, 2020, 2021) back into rows:

SELECT CustomerName, SalesYear, SalesAmount
FROM (
    SELECT CustomerName, [2019], [2020], [2021]
    FROM PivotedData
) AS PivotedData
UNPIVOT (
    SalesAmount FOR SalesYear IN ([2019], [2020], [2021])
) AS UnpivotedData;

Result:

Product SalesYear SalesAmount
Alice2019200
Alice2020300
Bob2019150
Bob2021400

However, it's important to note that these operations are dialect-specific. While SQL Server and Oracle natively support the PIVOT and UNPIVOT operators, databases like MySQL and PostgreSQL do not. In such cases, alternative methods like conditional aggregation (for pivoting) or JOIN/UNION operations (for unpivoting) achieve similar results.

For example, the below query achieves pivoting by using a combination of CASE statements and aggregation functions like SUM:

SELECT
    c.CustomerName,
    SUM(CASE WHEN YEAR(o.OrderDate) = 2019 THEN o.OrderAmount ELSE 0 END) AS [2019],
    SUM(CASE WHEN YEAR(o.OrderDate) = 2020 THEN o.OrderAmount ELSE 0 END) AS [2020],
    SUM(CASE WHEN YEAR(o.OrderDate) = 2021 THEN o.OrderAmount ELSE 0 END) AS [2021]
FROM
    Customers c
INNER JOIN
    Orders o ON c.CustomerID = o.CustomerID
GROUP BY
    c.CustomerName;

The below query achieves unpivoting using a UNION ALL query, manually stacking columns into rows:

SELECT CustomerName, '2019' AS SalesYear, [2019] AS SalesAmount
FROM PivotedData
UNION ALL
SELECT CustomerName, '2020' AS SalesYear, [2020] AS SalesAmount
FROM PivotedData
UNION ALL
SELECT CustomerName, '2021' AS SalesYear, [2021] AS SalesAmount
FROM PivotedData;

Case statements

Case statements allow you to perform conditional logic within a SQL query. They are useful for tasks such as data categorization or creating custom calculations based on specific conditions.

The below query uses a case statement to implement bonus calculation logic:

SELECT
    Name,
    Sales,
    CASE
        WHEN Sales >= 700 THEN Salary * 0.10
        WHEN Sales >= 500 THEN Salary * 0.05
        ELSE Sales * 0.02
    END AS Bonus
FROM Employees;

String manipulation

String manipulation functions enable text data modifications and analysis within queries, such as concatenation, trimming, and extracting substrings. These functions are especially useful in ETL processes, data cleaning, and reporting.

CONCAT combines multiple strings into one:

SELECT CONCAT(first_name, ' ', last_name) AS full_name
FROM employees;
  • TRIM removes leading and/or trailing whitespace or specified characters.
  • SUBSTRING in MySQL (or SUBSTR in SQL Server) extracts a portion of a string.
  • INSTR in MySQL (or CHARINDEX in SQL Server) finds the position of a substring within a string.
SELECT
    email,
    SUBSTRING(email, INSTR(email, '@') + 1) AS domain
FROM users;

Result:

email domain
john.doe@gmail.comgmail.com
alice.smith@xyz.comxyz.com

REPLACE substitutes occurrences of a substring with another substring:

SELECT
    phone_number,
    REPLACE(phone_number, '-', '') AS standardized_phone
FROM contacts;

Result:

phone_number standardized_phone
123-456-78901234567890
987-654-32109876543210

Regular expressions

Regular expressions are used for pattern matching, text manipulation, and validations within SQL. Databases like MySQL and PostgreSQL support regex natively, while others, such as SQL Server, may require additional functions or external libraries.

Example — validating email addresses in MySQL:

SELECT
    FullName,
    CASE
        WHEN Email REGEXP '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$' THEN 'Valid'
        ELSE 'Invalid'
    END AS EmailStatus
FROM Users;

Note: SQL syntax and function names for string manipulation (e.g., CHARINDEX in SQL Server vs. INSTR in MySQL) can vary across dialects. Always check the documentation for your specific SQL system to ensure compatibility.

Result:

FullName Email EmailStatus
John DoeJohn.doe@gmail.comValid
Emily SusanEmily.susan@xyzInvalid

Date and time functions

Date and time functions facilitate temporal analysis and transformations, such as calculating intervals, converting date formats, and comparing time-related data.

DATEPART extracts a specific part of a date, such as a year, month, day, or hour:

-- SQL Server
SELECT
    DATEPART(YEAR, GETDATE()) AS Year,
    DATEPART(MONTH, GETDATE()) AS Month,
    DATEPART(DAY, GETDATE()) AS Day;

-- PostgreSQL/Oracle
SELECT
    EXTRACT(YEAR FROM CURRENT_DATE) AS Year,
    EXTRACT(MONTH FROM CURRENT_DATE) AS Month,
    EXTRACT(DAY FROM CURRENT_DATE) AS Day;

DATEADD adds a specified interval to a date. For example, adding 7 days to the current date:

-- SQL Server
SELECT DATEADD(DAY, 7, GETDATE()) AS NewDate;

-- PostgreSQL
SELECT CURRENT_DATE + INTERVAL '7 days' AS NewDate;

-- Oracle
SELECT CURRENT_DATE + 7 AS NewDate;

DATEDIFF calculates the difference between two dates:

-- SQL Server
SELECT DATEDIFF(DAY, '2023-11-01', GETDATE()) AS DaysDifference;

-- PostgreSQL
SELECT EXTRACT(DAY FROM (CURRENT_DATE - DATE '2023-11-01')) AS DaysDifference;

-- Oracle
SELECT TRUNC(CURRENT_DATE - DATE '2023-11-01') AS DaysDifference;

FORMAT is useful for formatting dates:

-- SQL Server
SELECT FORMAT(GETDATE(), 'yyyy-MM-dd HH:mm:ss') AS FormattedDate;

-- PostgreSQL/Oracle
SELECT TO_CHAR(CURRENT_DATE, 'YYYY-MM-DD HH24:MI:SS') AS FormattedDate;

EOMONTH/LAST_DAY is useful for finding the last day of the month:

-- SQL Server
SELECT EOMONTH(GETDATE()) AS LastDayOfMonth;

-- Oracle
SELECT LAST_DAY(CURRENT_DATE) AS LastDayOfMonth;

--- PostgreSQL
SELECT (DATE_TRUNC('month', now()) + interval '1 month - 1 day');

Recommendations for writing complex SQL queries

Keep it simple

Avoid using complex joins, recursive functions, and nested aggregations unless absolutely necessary. Overly complex queries can be difficult to debug and may affect performance.

Build query bottom-up

Start by testing the basic query requirements. Then, introduce complexity step-by-step. This approach ensures that each query stage functions correctly before adding more layers.

Use more readable syntax

Use proper formatting, indentation, and comments to ensure clarity. Where applicable, simplify type conversions and transformations with shorthand syntax.

Adopt a declarative approach

Use a declarative approach to build reusable code blocks and make your SQL logic more modular and understandable. Declarative configurations emphasize specifying what to do rather than how to do it. They tend to be easier to debug, reuse, and optimize.

Unfortunately, SQL statements are only declarative in isolation. Multi-statement SQL scripts dictate the exact order of persistence (e.g., using INSERT and UPDATE in a specific order) and tend to become rigid and monolithic.

DataForge is an open-source, next-generation data transformation tool that improves SQL's declarative approach. It provides more flexibility and declarative power, especially for complex data transformations and processing tasks.

For example, we could define a "products" source using the below code with DataForge:

source_name: "Products"
source_query: "SELECT * FROM Products"
raw_attributes:
- product_id int
- product_name string
- average_cost decimal
- average_sell_price decimal
rules:
 - name: "Round Average Cost"
   expression: "ROUND([This].average_cost, 2)"
 - name: "Truncate Product Name"
   expression: "CASE
     WHEN LENGTH([This].product_name) > 30
     THEN CONCAT(LEFT([This].product_name, 30), '...')
     ELSE [This].product_name
   END"

In this declarative approach, DataForge allows users to define queries in terms of transformations and conditions rather than explicit step-by-step updates. This configuration creates reusable rules (e.g., rounding or truncating) and enforces modular, maintainable transformations.

Conclusion

SQL offers a powerful suite of tools for data manipulation and transformation. By mastering techniques like recursive queries, window functions, CTEs, pivoting, and date/time manipulation, engineers can perform intricate data analysis and report directly within SQL, streamlining processes and enhancing efficiency.