Refactor and Optimize a Complex SQL Query Using Window Functions
Given a complex SQL query with multiple nested subqueries and joins, refactor it to improve readability, performance, and maintainability while preserving the original output.
Challenge prompt
You are provided with a SQL query that calculates rank, running totals, and aggregates from a sales dataset. The original query uses multiple nested subqueries and joins that cause inefficient execution and are hard to maintain. Your task is to refactor the SQL code to use window functions and CTEs (Common Table Expressions) to improve query performance and clarity without changing the output results. Ensure your refactored query produces the same dataset with the same order and columns.
Guidance
- • Use window functions such as ROW_NUMBER(), RANK(), SUM() OVER(), or others to simplify aggregation and ranking logic.
- • Replace nested subqueries with CTE blocks for better modularity and readability.
- • Verify output consistency by comparing results before and after refactoring.
Hints
- • Look for repeated join patterns and subqueries that compute partial aggregates—these can frequently be replaced by window functions over partitions.
- • Common window functions like ROW_NUMBER() and SUM() OVER(PARTITION BY ...) can often eliminate inefficient grouping and self-joins.
- • Ensure ORDER BY clauses inside window functions match the original query’s ordering to preserve ranking logic.
Starter code
WITH sales_data AS (
SELECT
s.customer_id,
s.order_date,
s.amount,
c.region
FROM sales s
JOIN customers c ON s.customer_id = c.id
),
ranked_sales AS (
SELECT
sd.customer_id,
sd.order_date,
sd.amount,
sd.region,
(SELECT COUNT(DISTINCT s2.order_date)
FROM sales_data s2
WHERE s2.customer_id = sd.customer_id
AND s2.order_date <= sd.order_date) AS order_rank
FROM sales_data sd
),
aggregated_totals AS (
SELECT
customer_id,
MAX(order_rank) AS total_orders
FROM ranked_sales
GROUP BY customer_id
),
final_output AS (
SELECT
rs.customer_id,
rs.order_date,
rs.amount,
rs.region,
rs.order_rank,
at.total_orders,
(SELECT SUM(amount) FROM sales_data WHERE customer_id = rs.customer_id AND order_date <= rs.order_date) AS running_total
FROM ranked_sales rs
JOIN aggregated_totals at ON rs.customer_id = at.customer_id
)
SELECT * FROM final_output
ORDER BY customer_id, order_date;Expected output
The result table with columns: customer_id, order_date, amount, region, order_rank (rank of order_date per customer), total_orders (total number of orders per customer), running_total (cumulative amount per customer ordered by date). The rows should be ordered by customer_id then order_date, with the same values as the original nested-subquery output.
Core concepts
Challenge a Friend
Send this duel to someone else and see if they can solve it.