Introduction to SQL
Basic SQL Syntax and Queries
Filtering and Sorting Data
Joining Tables
Aggregating Data
Subqueries and Nested Queries
Modifying Data
Database Constraints and Integrity
Advanced SQL Functions
Database Design and Normalization
This chapter provides the foundational understanding of SQL (Structured Query Language), its role in interacting with databases, and the essential concepts you’ll need to work with SQL effectively.
Definition: SQL is a standardized language used for managing and manipulating relational databases. It allows users to perform various operations like querying, updating, inserting, and deleting data.
Purpose: SQL helps in querying databases to retrieve information, managing database schemas, and performing administrative tasks like securing and maintaining data.
Origins: SQL was developed by IBM in the early 1970s as part of the System R project. It was influenced by relational database theory proposed by Edgar F. Codd.
Standardization: SQL became a standard in 1986 by the American National Standards Institute (ANSI) and later by the International Organization for Standardization (ISO).
Evolution: Over time, SQL has evolved with new versions and implementations across various database management systems (DBMS) like MySQL, PostgreSQL, Oracle, SQL Server, and SQLite.
Relational Databases: SQL is primarily used for interacting with relational databases, which organize data into tables (relations) consisting of rows and columns.
Database Management Systems (DBMS): A DBMS is a software that interacts with databases. Examples include MySQL, PostgreSQL, Microsoft SQL Server, and Oracle Database. SQL is the query language these systems use to manipulate and access the data.
Tables: Data in relational databases is stored in tables, each with a set of rows and columns.
Rows: Each row represents a single record or data entry.
Columns: Each column in a table represents an attribute or a field of the record.
Schemas: A database schema defines the structure of the database, including the tables, relationships, and constraints.
SQL operations are categorized into four major types:
DML (Data Manipulation Language): Commands used to manipulate data within tables. Examples include:
SELECT: Retrieve data from one or more tables.
INSERT: Add new records to a table.
UPDATE: Modify existing records.
DELETE: Remove records from a table.
DDL (Data Definition Language): Commands used to define and manage database structures. Examples include:
CREATE: Create new tables, schemas, or databases.
ALTER: Modify existing database objects.
DROP: Delete database objects like tables or databases.
DCL (Data Control Language): Commands used to manage access control and permissions. Examples include:
GRANT: Give users access to database resources.
REVOKE: Remove access or permissions from users.
TCL (Transaction Control Language): Commands that deal with transactions, ensuring data integrity. Examples include:
COMMIT: Save the changes made during a transaction.
ROLLBACK: Undo changes made during a transaction.
Basic Query Structure: The most common SQL operation is querying data from a table using the SELECT
statement. A basic query looks like this:
SELECT column1, column2 FROM table_name;
Clauses: SQL queries consist of different clauses such as:
SELECT
: Specifies the columns to be returned.
FROM
: Specifies the table to retrieve data from.
WHERE
: Adds conditions to filter the data.
ORDER BY
: Sorts the result.
Example:
SELECT first_name, last_name FROM employees WHERE department = 'HR';
SQL commands are case-insensitive (e.g., select
is the same as SELECT
).
Keywords like SELECT
, FROM
, and WHERE
should be written in uppercase for readability (though lowercase works too).
Commands must end with a semicolon (;
), especially when executing multiple commands.
SQL supports various data types, including:
Numeric Types: INT
, DECIMAL
, FLOAT
, etc.
Text Types: VARCHAR
, CHAR
, TEXT
.
Date and Time Types: DATE
, TIME
, DATETIME
.
Boolean: BOOLEAN
, representing true or false.
Simple SELECT: Retrieve all data from a table.
SELECT * FROM employees;
SELECT with WHERE: Filter data using a condition.
SELECT * FROM employees WHERE department = 'Sales';
SELECT with ORDER BY: Sort data by a column.
SELECT * FROM employees ORDER BY last_name ASC;
Real-World Use Cases: SQL is widely used in business applications, web development, data analysis, and reporting. Understanding SQL allows you to extract meaningful insights from data, manage databases, and ensure data integrity.
SQL and Reporting: SQL is essential in generating reports, such as sales data, inventory management, and customer insights, often used for decision-making in businesses.
Chapter 1 introduces the core concepts of SQL, including its role in interacting with relational databases, the basics of SQL syntax, and the key operations you can perform using SQL. Understanding these basics is crucial before diving deeper into more advanced topics like joins, subqueries, and database design.
This chapter introduces the foundational elements of SQL syntax, focusing on the basic structure of SQL queries and the most common operations used to interact with relational databases.
SQL statements have a standard structure, with commands, clauses, and expressions. While SQL syntax is not case-sensitive, it’s a common practice to write keywords in uppercase for clarity.
SQL queries are composed of one or more of the following components:
Clauses: Parts of the SQL query that define the operation.
Expressions: Conditions or formulas used to define the data.
Predicates: Conditions that filter rows, such as comparisons or logical operators.
Example of basic syntax structure:
SELECT column1, column2
FROM table_name
WHERE condition
ORDER BY column;
Purpose: The SELECT
statement is used to retrieve data from one or more tables in a database.
Basic Syntax: To select all data from a table:
SELECT * FROM table_name;
*
is a wildcard that selects all columns from the specified table.Example:
SELECT first_name, last_name FROM employees;
Selecting Specific Columns: Instead of *
, you can specify which columns you want to retrieve.
SELECT first_name, salary FROM employees;
The FROM
clause specifies the table from which to retrieve data.
Example: To query from the “employees” table:
SELECT * FROM employees;
Purpose: The WHERE
clause filters the rows based on specific conditions. It allows you to retrieve only the data that meets certain criteria.
Basic Syntax: The WHERE
clause is used to specify the condition for selecting data:
SELECT column1, column2 FROM table_name WHERE condition;
Common conditions:
Comparison Operators: =
, >
, <
, >=
, <=
, <>
(not equal).
SELECT * FROM employees WHERE salary > 50000;
Logical Operators: AND
, OR
, NOT
are used to combine conditions.
SELECT * FROM employees WHERE salary > 50000 AND department = 'HR';
BETWEEN: Used to filter data within a range.
SELECT * FROM employees WHERE salary BETWEEN 40000 AND 60000;
IN: Allows you to specify multiple values in the WHERE
clause.
SELECT * FROM employees WHERE department IN ('HR', 'Sales');
LIKE: Used for pattern matching with wildcards (%
for any sequence of characters, _
for a single character).
SELECT * FROM employees WHERE last_name LIKE 'S%';
IS NULL: Checks for NULL values.
SELECT * FROM employees WHERE middle_name IS NULL;
Purpose: The ORDER BY
clause sorts the results of a query in either ascending (default) or descending order.
Basic Syntax:
SELECT column1, column2 FROM table_name ORDER BY column1 [ASC|DESC];
Sorting Order:
ASC
: Ascending order (default).
DESC
: Descending order.
Example: Sorting employees by salary in descending order:
SELECT first_name, last_name, salary FROM employees ORDER BY salary DESC;
Purpose: The LIMIT
clause restricts the number of rows returned by a query.
Syntax:
SELECT column1, column2 FROM table_name LIMIT number_of_rows;
Example: Get only the top 5 employees with the highest salaries:
SELECT first_name, last_name, salary FROM employees ORDER BY salary DESC LIMIT 5;
Purpose: The DISTINCT
keyword removes duplicate rows from the result set.
Syntax:
SELECT DISTINCT column_name FROM table_name;
Example: Retrieve unique job titles from the employees table:
SELECT DISTINCT job_title FROM employees;
Purpose: The UNION
operator is used to combine the results of two or more SELECT
queries. All queries combined using UNION
must have the same number of columns with compatible data types.
Syntax:
SELECT column1, column2 FROM table1
UNION
SELECT column1, column2 FROM table2;
Example: Combine employee names from two different tables (e.g., full-time and part-time employees):
SELECT first_name, last_name FROM full_time_employees
UNION
SELECT first_name, last_name FROM part_time_employees;
Purpose: Aliases provide a temporary name for tables or columns in a query, making the result set easier to understand or more readable.
Column Alias:
SELECT first_name AS 'Employee First Name', last_name AS 'Employee Last Name' FROM employees;
Table Alias: Alias for tables is useful when joining multiple tables.
SELECT e.first_name, e.last_name, d.department_name
FROM employees AS e
JOIN departments AS d ON e.department_id = d.department_id;
Syntax Errors: If you write an incorrect SQL statement (such as a misspelled keyword or missing clause), the database system will return an error.
Logical Errors: Logical errors, such as incorrect filtering or sorting, might not trigger errors but can return unexpected results. It’s important to carefully check query logic when results don’t match expectations.
Debugging Tips:
Check the SQL syntax.
Verify column and table names are correct.
Use comments (--
for single-line and /* */
for multi-line comments) to isolate parts of your query for testing.
Chapter 2 introduces you to the basic SQL syntax for querying data. You learn how to construct simple queries using SELECT
, FROM
, WHERE
, ORDER BY
, and LIMIT
clauses. You also learn how to manipulate results with DISTINCT
, use logical operators, and combine queries with UNION
. This foundational knowledge is essential for building more advanced queries and interacting effectively with databases.
This chapter delves deeper into how to filter and sort data in SQL queries, which is essential for retrieving specific information from large datasets. You’ll learn how to apply various conditions to select only the relevant data, and how to order the results in a meaningful way.
The WHERE
clause is the most common way to filter data. It helps you define conditions that rows must meet to be included in the result set. The condition in the WHERE
clause can be a comparison, a logical operator, or more complex expressions.
Basic Syntax:
SELECT column1, column2 FROM table_name WHERE condition;
Conditions:
Comparison Operators: These are used to compare values.
=
: Equal to
>
: Greater than
<
: Less than
>=
: Greater than or equal to
<=
: Less than or equal to
<>
or !=
: Not equal to
Example: Get employees whose salary is greater than 50,000.
SELECT first_name, last_name, salary FROM employees WHERE salary > 50000;
Logical Operators: These allow combining multiple conditions.
AND
: Ensures all conditions are true.
OR
: Ensures at least one condition is true.
NOT
: Reverses the condition (e.g., NOT NULL
, NOT IN
).
Example: Get employees with a salary above 50,000 and who work in the ‘Sales’ department.
SELECT first_name, last_name, salary
FROM employees
WHERE salary > 50000 AND department = 'Sales';
BETWEEN: Used to filter data within a specific range. It is inclusive of the boundary values.
SELECT * FROM employees WHERE salary BETWEEN 40000 AND 60000;
IN: Specifies a list of possible values. It’s an easier way to test if a value exists in a given list of values.
SELECT * FROM employees WHERE department IN ('HR', 'Sales', 'IT');
LIKE: Used for pattern matching, typically with wildcards. It’s useful for partial matches.
%
: Matches zero or more characters.
_
: Matches a single character.
SELECT * FROM employees WHERE last_name LIKE 'S%'; -- Last names starting with S
IS NULL / IS NOT NULL: Used to check for NULL values.
SELECT * FROM employees WHERE middle_name IS NULL;
In this section, you’ll learn more complex filtering methods to refine data retrieval.
Combining Conditions with Parentheses: Parentheses help organize complex conditions and ensure correct evaluation order.
SELECT first_name, last_name, salary
FROM employees
WHERE (salary > 50000 OR department = 'HR') AND hire_date > '2020-01-01';
Using EXISTS
: The EXISTS
operator is used in correlated subqueries to check if the subquery returns any rows.
SELECT first_name, last_name
FROM employees e
WHERE EXISTS (SELECT * FROM departments d WHERE e.department_id = d.department_id);
Using NULL
with Comparison Operators: Remember, comparing with NULL
directly using =
or <>
doesn’t work. Instead, use IS NULL
or IS NOT NULL
.
SELECT * FROM employees WHERE commission_pct IS NOT NULL;
Sorting is an essential part of data retrieval, especially when you’re working with large datasets. The ORDER BY
clause helps to organize the result set based on one or more columns.
Basic Syntax:
SELECT column1, column2 FROM table_name ORDER BY column1 [ASC|DESC];
ASC (Ascending Order): This is the default. It sorts the result from the smallest to largest (or alphabetically from A to Z).
DESC (Descending Order): Sorts the result from largest to smallest (or alphabetically from Z to A).
Examples:
Sort employees by their last name in ascending order:
SELECT first_name, last_name FROM employees ORDER BY last_name ASC;
Sort employees by salary in descending order:
SELECT first_name, last_name, salary FROM employees ORDER BY salary DESC;
Sort by multiple columns:
SELECT first_name, last_name, department, salary
FROM employees
ORDER BY department ASC, salary DESC;
In this case, employees are first sorted by department in ascending order. Within each department, employees are sorted by salary in descending order.
When dealing with large datasets, it’s often useful to limit the number of rows returned by a query. This is where the LIMIT
clause (or TOP
in some databases like SQL Server) comes in handy.
Basic Syntax for LIMIT:
SELECT column1, column2 FROM table_name LIMIT number_of_rows;
Example: Retrieve the top 5 employees with the highest salaries.
SELECT first_name, last_name, salary FROM employees ORDER BY salary DESC LIMIT 5;
OFFSET and LIMIT: To implement paging, you can use LIMIT
with OFFSET
. The OFFSET
specifies how many rows to skip.
SELECT first_name, last_name FROM employees ORDER BY last_name ASC LIMIT 10 OFFSET 10;
Combining GROUP BY
with HAVING
: When dealing with aggregate functions (e.g., COUNT
, AVG
, SUM
), you often need to group data by certain columns. The GROUP BY
clause is used to group the result set by one or more columns, and the HAVING
clause is used to filter groups based on a condition, similar to how the WHERE
clause works for individual rows.
Example: Find departments with an average salary greater than 50,000.
SELECT department, AVG(salary)
FROM employees
GROUP BY department
HAVING AVG(salary) > 50000;
Window functions allow you to perform calculations across a set of table rows that are somehow related to the current row. This can be helpful for sorting and filtering based on complex conditions without collapsing the rows into aggregated results.
ROW_NUMBER(): Assigns a unique number to each row within a partition of a result set.
SELECT first_name, last_name, salary,
ROW_NUMBER() OVER (ORDER BY salary DESC) AS row_num
FROM employees;
This query assigns a ranking number based on salary in descending order.
SQL allows you to combine values and functions to filter and sort data in more advanced ways.
if
or switch
statement in programming, a CASE
expression allows you to return a value based on conditions.Example: Classify employees based on their salary ranges.
SELECT first_name, last_name, salary,
CASE
WHEN salary < 30000 THEN 'Low'
WHEN salary BETWEEN 30000 AND 70000 THEN 'Medium'
ELSE 'High'
END AS salary_range
FROM employees;
Chapter 3 expands your knowledge of filtering and sorting data in SQL. You now understand how to use various comparison and logical operators in the WHERE
clause, how to organize results with the ORDER BY
clause, and how to limit or page through results. You also learned how to filter data based on aggregate functions using GROUP BY
and HAVING
, and how window functions can provide additional power for sorting and filtering complex datasets. These skills are essential for narrowing down and organizing data in a meaningful way.
This chapter explores SQL joins, which allow you to combine data from two or more tables based on a related column. Joins are crucial when working with normalized databases, where data is split across multiple tables to reduce redundancy. Understanding how to use joins will help you extract useful insights from complex datasets.
A join in SQL is a way to combine rows from two or more tables based on a related column between them. Each table contains different pieces of related data, and joining them together creates a more complete dataset. The most common reason for using joins is to retrieve data from multiple tables simultaneously.
There are several types of joins:
Inner Join
Left Join (or Left Outer Join)
Right Join (or Right Outer Join)
Full Outer Join
Cross Join
Self Join
Definition: The INNER JOIN
returns only the rows where there is a match in both tables. If there is no match, those rows are excluded from the result set.
Basic Syntax:
SELECT columns
FROM table1
INNER JOIN table2
ON table1.column = table2.column;
Example: Retrieve employees with their department names from the employees
and departments
tables.
SELECT employees.first_name, employees.last_name, departments.department_name
FROM employees
INNER JOIN departments ON employees.department_id = departments.department_id;
In this example, only employees who belong to a department will be returned. If an employee doesn’t belong to any department, they are excluded.
Definition: The LEFT JOIN
(also known as LEFT OUTER JOIN
) returns all rows from the left table (table1), and the matched rows from the right table (table2). If there is no match, the result is NULL
on the side of the right table.
Basic Syntax:
SELECT columns
FROM table1
LEFT JOIN table2
ON table1.column = table2.column;
Example: Retrieve all employees, even if they don’t belong to a department.
SELECT employees.first_name, employees.last_name, departments.department_name
FROM employees
LEFT JOIN departments ON employees.department_id = departments.department_id;
This query returns all employees, and for those without a department, the department_name
will be NULL
.
Definition: The RIGHT JOIN
(also known as RIGHT OUTER JOIN
) is the opposite of the LEFT JOIN
. It returns all rows from the right table (table2) and the matched rows from the left table (table1). If there is no match, the result is NULL
on the left side.
Basic Syntax:
SELECT columns
FROM table1
RIGHT JOIN table2
ON table1.column = table2.column;
Example: Retrieve all departments, even if there are no employees in those departments.
SELECT employees.first_name, employees.last_name, departments.department_name
FROM employees
RIGHT JOIN departments ON employees.department_id = departments.department_id;
This query returns all departments, and for departments with no employees, the first_name
and last_name
will be NULL
.
Definition: The FULL OUTER JOIN
returns all rows when there is a match in either the left or the right table. If there is no match, NULL
is returned for missing values on either side.
Basic Syntax:
SELECT columns
FROM table1
FULL OUTER JOIN table2
ON table1.column = table2.column;
Example: Retrieve all employees and all departments, even if there’s no match between the two.
SELECT employees.first_name, employees.last_name, departments.department_name
FROM employees
FULL OUTER JOIN departments ON employees.department_id = departments.department_id;
This query returns all employees and all departments. For employees without a department or departments without employees, the non-matching fields will be NULL
.
Definition: The CROSS JOIN
returns the Cartesian product of two tables. This means it will return all possible combinations of rows from the first table with all rows from the second table. Be careful when using this join, as it can produce a large number of results if the tables involved have many rows.
Basic Syntax:
SELECT columns
FROM table1
CROSS JOIN table2;
Example: Get all possible pairs of employees and departments.
SELECT employees.first_name, departments.department_name
FROM employees
CROSS JOIN departments;
If there are 10 employees and 5 departments, this query will return 50 rows (10 x 5 = 50 combinations).
Definition: A self join is a regular join, but the table is joined with itself. You can use a self join to compare rows within the same table.
Basic Syntax:
SELECT a.column1, b.column2
FROM table a
JOIN table b
ON a.column = b.column;
Example: Retrieve employees who report to other employees (assuming we have a manager_id
column).
SELECT e1.first_name AS Employee, e2.first_name AS Manager
FROM employees e1
LEFT JOIN employees e2 ON e1.manager_id = e2.employee_id;
In this query, e1
represents employees, and e2
represents their managers. The LEFT JOIN
ensures we still get employees even if they don’t have a manager (i.e., their manager_id
is NULL
).
Multiple Joins: You can join more than two tables in a single query by chaining joins.
Example: Retrieve employee names, their department names, and their manager names.
SELECT e.first_name AS Employee, d.department_name, m.first_name AS Manager
FROM employees e
INNER JOIN departments d ON e.department_id = d.department_id
LEFT JOIN employees m ON e.manager_id = m.employee_id;
This query joins the employees
table twice—once for the employees themselves and once for their managers. The LEFT JOIN
ensures employees with no managers are included.
Join Conditions: You can join tables based on any condition, not just equality. For example, you can use inequality conditions (>
, <
, >=
, etc.) or combinations of multiple conditions.
Example: Retrieve all employees who started after their manager.
SELECT e.first_name, e.hire_date, m.first_name AS Manager, m.hire_date AS Manager_Hire_Date
FROM employees e
INNER JOIN employees m ON e.manager_id = m.employee_id
WHERE e.hire_date > m.hire_date;
Table Aliases: Using aliases for tables (like e
for employees) makes your SQL queries shorter and easier to read.
Column Aliases: Similarly, using aliases for columns (e.g., e.first_name AS Employee_Name
) improves readability.
Example:
SELECT e.first_name AS Employee_Name, d.department_name AS Department, m.first_name AS Manager_Name
FROM employees e
INNER JOIN departments d ON e.department_id = d.department_id
LEFT JOIN employees m ON e.manager_id = m.employee_id;
Indexes: To improve the performance of joins, ensure the columns used in the ON
condition are indexed. For example, indexing employee_id
in both tables will speed up the INNER JOIN
.
Reducing Join Size: Be mindful of the number of rows involved in the join. For large tables, a poorly designed join can significantly impact performance. Try to minimize unnecessary rows by using appropriate filters (WHERE
clause) before performing joins.
Optimizing with EXPLAIN
: Use the EXPLAIN
keyword to analyze the execution plan of a query and optimize joins.
Chapter 4 provides an in-depth exploration of SQL joins. You’ve learned how to use different types of joins—INNER JOIN
, LEFT JOIN
, RIGHT JOIN
, FULL OUTER JOIN
, CROSS JOIN
, and SELF JOIN
—to retrieve data from multiple tables. You’ve also learned how to combine tables, apply conditions, use aliases for readability, and optimize queries. Joins are a critical part of querying relational databases and are essential for retrieving meaningful insights from connected data.
In this chapter, we’ll focus on aggregating data, which is a critical skill for summarizing, analyzing, and extracting useful insights from large datasets. SQL provides powerful functions to perform operations such as counting, summing, averaging, and more. You’ll learn how to use these functions in combination with grouping and filtering to aggregate data effectively.
Definition: Aggregation refers to the process of summarizing data by performing calculations across multiple rows. This is typically done using aggregate functions like COUNT()
, SUM()
, AVG()
, MIN()
, and MAX()
.
Aggregating data is often useful when you want to get a high-level overview of your data, such as the total sales, average salary, or maximum temperature in a city over time.
SQL provides several built-in aggregate functions that operate on a set of values and return a single result.
COUNT(): Returns the number of rows that match a given condition or the total number of rows in a table.
Syntax:
SELECT COUNT(column_name) FROM table_name;
Example: Get the number of employees in the employees
table.
SELECT COUNT(*) FROM employees;
Example with a Condition: Get the number of employees with a salary above 50,000.
SELECT COUNT(*) FROM employees WHERE salary > 50000;
SUM(): Adds up the values of a specified column.
Syntax:
SELECT SUM(column_name) FROM table_name;
Example: Get the total salary of all employees.
SELECT SUM(salary) FROM employees;
AVG(): Calculates the average value of a specified column.
Syntax:
SELECT AVG(column_name) FROM table_name;
Example: Get the average salary of employees.
SELECT AVG(salary) FROM employees;
MIN(): Returns the smallest value in a specified column.
Syntax:
SELECT MIN(column_name) FROM table_name;
Example: Get the minimum salary of employees.
SELECT MIN(salary) FROM employees;
MAX(): Returns the largest value in a specified column.
Syntax:
SELECT MAX(column_name) FROM table_name;
Example: Get the maximum salary of employees.
SELECT MAX(salary) FROM employees;
Definition: The GROUP BY
clause is used in combination with aggregate functions to group rows that share a property into summary rows, like “total sales per region” or “average salary by department”.
Basic Syntax:
SELECT column1, AGGREGATE_FUNCTION(column2)
FROM table_name
GROUP BY column1;
Example: Get the total salary for each department.
SELECT department, SUM(salary)
FROM employees
GROUP BY department;
Explanation: In this example, the employees
table is grouped by the department
column, and for each group (department), we calculate the sum of the salary
column. This query returns the total salary per department.
Definition: The HAVING
clause is used to filter results after aggregation, similar to how the WHERE
clause is used to filter rows before aggregation. The key difference is that HAVING
works with groups created by the GROUP BY
clause, while WHERE
filters individual rows.
Basic Syntax:
SELECT column1, AGGREGATE_FUNCTION(column2)
FROM table_name
GROUP BY column1
HAVING condition;
Example: Get departments with a total salary greater than 200,000.
SELECT department, SUM(salary)
FROM employees
GROUP BY department
HAVING SUM(salary) > 200000;
Explanation: In this query, we first group employees by department and calculate the total salary per department using SUM(salary)
. Then, we filter the departments to only include those with a total salary greater than 200,000 using the HAVING
clause.
Example: Get the total salary, average salary, and the number of employees in each department.
SELECT department,
SUM(salary) AS total_salary,
AVG(salary) AS avg_salary,
COUNT(*) AS num_employees
FROM employees
GROUP BY department;
salary
column, but they are grouped by the department
column.DISTINCT
keyword within aggregate functions to perform the calculation on only unique (distinct) values.Example: Get the number of unique job titles in the employees
table.
SELECT COUNT(DISTINCT job_title) FROM employees;
employees
table. If there are multiple employees with the same job title, only one instance of that title is counted.JOIN
to summarize data from related tables.Example: Get the total salary and the number of employees in each department, joining the employees
table with the departments
table.
SELECT d.department_name,
COUNT(e.employee_id) AS num_employees,
SUM(e.salary) AS total_salary
FROM employees e
INNER JOIN departments d ON e.department_id = d.department_id
GROUP BY d.department_name;
employees
table with the departments
table on department_id
, then groups the result by department name.Window functions allow you to perform aggregations across a set of rows that are related to the current row within the result set, without collapsing the rows into a single result.
ROW_NUMBER(): Assigns a unique sequential integer to rows within a partition.
RANK(): Assigns a rank to rows within a partition, with gaps in the ranking for ties.
SUM() OVER(): Performs a sum calculation over a specific window of rows.
Example: Get the running total of salaries for each employee, ordered by their salary.
SELECT first_name, last_name, salary,
SUM(salary) OVER (ORDER BY salary) AS running_total
FROM employees;
SUM(salary) OVER (ORDER BY salary)
calculates a cumulative sum as it processes each row, ordered by salary.Indexes: Aggregating data over large datasets can be slow if the columns used in the aggregation or the GROUP BY
clause are not indexed. Indexing these columns can significantly speed up the query.
Memory Usage: Aggregation operations, especially on large datasets, can consume a lot of memory. Use LIMIT
and OFFSET
when working with very large datasets to reduce the result size.
Optimization: Consider using optimized database engines or breaking down complex aggregations into smaller steps if performance becomes a concern.
Chapter 5 covers the essentials of aggregating data in SQL using functions like COUNT()
, SUM()
, AVG()
, MIN()
, and MAX()
. You’ve learned how to group data with the GROUP BY
clause, filter grouped data with the HAVING
clause, and use multiple aggregate functions in a single query. Additionally, we covered using DISTINCT
with aggregates, performing aggregations with JOIN
, and leveraging window functions for more complex aggregation tasks. Aggregating data is a powerful tool for summarizing large datasets and making data-driven decisions.
In this chapter, we will explore subqueries and nested queries, two powerful tools in SQL that allow you to perform more complex operations by embedding queries within other queries. Subqueries can be used to filter, calculate, or transform data in ways that would be difficult with a single query.
A subquery (or inner query) is a query embedded inside another query. It allows you to perform intermediate steps in a query and use their results as part of the outer query.
Subqueries can be used in various places within a SQL statement:
In the WHERE clause to filter rows based on the result of another query.
In the SELECT clause to compute a value based on another query.
In the FROM clause to define a derived table.
Example of a subquery:
SELECT first_name, last_name, salary
FROM employees
WHERE department_id = (SELECT department_id FROM departments WHERE department_name = 'Sales');
(SELECT department_id FROM departments WHERE department_name = 'Sales')
returns the department_id
of the ‘Sales’ department, and the outer query uses that value to retrieve employees from that department.There are two main types of subqueries:
Single-row Subqueries: Return only one row and one column.
Multi-row Subqueries: Return more than one row and potentially multiple columns.
Example: Retrieve employees who earn more than the average salary.
SELECT first_name, last_name, salary
FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);
(SELECT AVG(salary) FROM employees)
calculates the average salary, and the outer query retrieves employees with salaries greater than that average.Example: Retrieve employees who work in departments with more than 10 employees.
SELECT first_name, last_name, department_id
FROM employees
WHERE department_id IN (SELECT department_id FROM employees GROUP BY department_id HAVING COUNT(*) > 10);
(SELECT department_id FROM employees GROUP BY department_id HAVING COUNT(*) > 10)
returns a list of department IDs with more than 10 employees, and the outer query retrieves employees who belong to those departments.Example: Get employees who earn more than the average salary in their own department.
SELECT first_name, last_name, salary, department_id
FROM employees e
WHERE salary > (SELECT AVG(salary) FROM employees WHERE department_id = e.department_id);
(SELECT AVG(salary) FROM employees WHERE department_id = e.department_id)
calculates the average salary for each department dynamically as the outer query processes each row.Example: Get the department-wise average salary, but only for departments with more than 5 employees.
SELECT department_id, AVG(salary) AS avg_salary
FROM (SELECT department_id, salary FROM employees WHERE department_id IN (SELECT department_id FROM employees GROUP BY department_id HAVING COUNT(*) > 5)) AS dept_salaries
GROUP BY department_id;
(SELECT department_id, salary FROM employees WHERE department_id IN (SELECT department_id FROM employees GROUP BY department_id HAVING COUNT(*) > 5))
is used to create a derived table (dept_salaries
) that contains only the relevant rows. The outer query then calculates the average salary for each department.Subqueries are commonly used with aggregate functions like COUNT()
, SUM()
, AVG()
, MIN()
, and MAX()
to calculate values that are then used in the outer query.
Example: Get employees whose salary is greater than the average salary of all employees.
SELECT first_name, last_name, salary
FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);
(SELECT AVG(salary) FROM employees)
returns the average salary for all employees, and the outer query retrieves employees who earn more than this average.You can use subqueries in the SELECT
clause to compute a value for each row returned by the outer query.
Example: Get the employee’s name and their department’s average salary.
SELECT first_name, last_name,
(SELECT AVG(salary) FROM employees WHERE department_id = e.department_id) AS avg_salary
FROM employees e;
EXISTS
operator is used to check if a subquery returns any rows. It returns TRUE
if the subquery returns at least one row, and FALSE
otherwise.Example: Get employees who belong to a department that has employees with salaries greater than 100,000.
SELECT first_name, last_name
FROM employees e
WHERE EXISTS (SELECT 1 FROM employees WHERE department_id = e.department_id AND salary > 100000);
e.department_id
) with a salary greater than 100,000. The outer query returns those employees for whom the subquery condition is met.IN: The IN
operator is used to match a value against a list of values returned by a subquery.
SELECT first_name, last_name
FROM employees
WHERE department_id IN (SELECT department_id FROM departments WHERE location = 'New York');
ANY: The ANY
operator compares a value to a set of values returned by a subquery. It returns TRUE
if the condition is TRUE
for any row in the subquery result.
SELECT first_name, last_name
FROM employees
WHERE salary > ANY (SELECT salary FROM employees WHERE department_id = 3);
ALL: The ALL
operator compares a value to all values returned by a subquery. It returns TRUE
only if the condition is TRUE
for all rows in the subquery result.
SELECT first_name, last_name
FROM employees
WHERE salary > ALL (SELECT salary FROM employees WHERE department_id = 3);
Efficiency: Subqueries can be resource-intensive, especially if used with large datasets. Subqueries in the WHERE
clause are executed for each row in the outer query (in the case of correlated subqueries), which can slow down performance.
Optimization: Sometimes, subqueries can be replaced by joins, which might be more efficient. For example, instead of using a subquery to find the maximum salary in the WHERE
clause, a JOIN
or GROUP BY
might be more efficient.
EXPLAIN: Use the EXPLAIN
command to understand the execution plan of your query and optimize subqueries accordingly.
Keep it Simple: Try to keep subqueries as simple as possible. Complex, nested subqueries can be difficult to read and maintain.
Avoid Unnecessary Subqueries: Whenever possible, try to use joins instead of subqueries. Joins are often more efficient and easier to understand.
Use Subqueries with Care: When you need to perform intermediate calculations or comparisons, subqueries are invaluable. But always test performance, especially on large datasets.
Chapter 6 covers the concept of subqueries and nested queries in SQL, including the different types of subqueries (single-row, multi-row, correlated), and where to use them in various parts of a SQL query (WHERE, SELECT, FROM). We explored common subquery operations like using aggregate functions, checking conditions with EXISTS
, and utilizing IN
, ANY
, and ALL
operators. Additionally, we touched on performance considerations and best practices for writing efficient subqueries. Mastering subqueries is essential for solving complex problems in SQL and getting the most out of your data.
This chapter focuses on modifying the data in a relational database using SQL commands. While querying data is essential for retrieving information, modifying data is crucial for managing and maintaining databases. In SQL, we can modify data using the INSERT
, UPDATE
, and DELETE
commands.
The INSERT
statement is used to add new rows to a table. You can either insert specific column values or insert multiple rows at once.
Basic Syntax (Inserting One Row):
INSERT INTO table_name (column1, column2, column3, ...)
VALUES (value1, value2, value3, ...);
Example: Insert a new employee into the employees
table.
INSERT INTO employees (first_name, last_name, salary, department_id)
VALUES ('John', 'Doe', 55000, 3);
employees
table, with values provided for the first_name
, last_name
, salary
, and department_id
columns.Inserting Data Without Specifying Column Names:
INSERT INTO employees
VALUES ('Jane', 'Smith', 60000, 2);
Inserting Multiple Rows:
You can insert multiple rows at once by using multiple sets of values.
INSERT INTO employees (first_name, last_name, salary, department_id)
VALUES
('Mark', 'Johnson', 45000, 2),
('Sara', 'Williams', 48000, 3),
('Paul', 'Brown', 52000, 1);
The UPDATE
statement is used to modify the existing records in a table. You can update one or more rows by specifying conditions in the WHERE
clause.
Basic Syntax:
UPDATE table_name
SET column1 = value1, column2 = value2, ...
WHERE condition;
Example: Update the salary of an employee.
UPDATE employees
SET salary = 60000
WHERE employee_id = 5;
salary
for the employee with employee_id = 5
to 60,000.Updating Multiple Columns:
You can update multiple columns in a single UPDATE
statement.
UPDATE employees
SET salary = 65000, department_id = 4
WHERE employee_id = 5;
salary
and department_id
columns are updated for the employee with employee_id = 5
.Important Considerations:
WHERE
clause to specify which rows to update. If you omit the WHERE
clause, all rows in the table will be updated.UPDATE employees
SET salary = 50000;
Using Subqueries in UPDATE: You can also use subqueries in the SET
clause to dynamically update values based on other tables.
UPDATE employees
SET salary = (SELECT AVG(salary) FROM employees WHERE department_id = employees.department_id)
WHERE department_id = 3;
The DELETE
statement is used to remove one or more rows from a table based on a specified condition.
Basic Syntax:
DELETE FROM table_name
WHERE condition;
Example: Delete an employee based on employee_id
.
DELETE FROM employees
WHERE employee_id = 10;
employee_id = 10
.Deleting Multiple Rows: You can delete multiple rows by specifying a condition that matches several rows.
DELETE FROM employees
WHERE department_id = 3;
Caution: Like the UPDATE
statement, the DELETE
statement should always include a WHERE
clause. If you omit it, all rows in the table will be deleted, which is a dangerous operation.
DELETE FROM employees;
employees
table. Always double-check the conditions before running DELETE
statements.Using Subqueries with DELETE: You can also use subqueries with DELETE
to remove rows based on a condition from another table.
DELETE FROM employees
WHERE department_id IN (SELECT department_id FROM departments WHERE department_name = 'Sales');
Definition: The TRUNCATE
statement removes all rows from a table, but unlike DELETE
, it does not log individual row deletions and cannot be rolled back in some database systems (such as MySQL).
Syntax:
TRUNCATE TABLE table_name;
Example: Remove all rows from the employees
table.
TRUNCATE TABLE employees;
Difference Between DELETE
and TRUNCATE
:
DELETE
: Can delete specific rows using a WHERE
clause. It is a slower operation as it logs each row deletion and can be rolled back if part of a transaction.
TRUNCATE
: Removes all rows from the table, does not log each row deletion, and cannot be rolled back (depending on the DBMS).
A transaction in SQL allows you to execute multiple operations as a single unit. Transactions are particularly useful for ensuring data consistency and integrity.
Commands:
BEGIN TRANSACTION
: Starts a new transaction.
COMMIT
: Saves all changes made in the current transaction.
ROLLBACK
: Reverts all changes made in the current transaction.
Example: Insert a new employee and update the department in a single transaction.
BEGIN TRANSACTION;
INSERT INTO employees (first_name, last_name, salary, department_id)
VALUES ('Lisa', 'Taylor', 55000, 3);
UPDATE employees
SET department_id = 4
WHERE first_name = 'Lisa' AND last_name = 'Taylor';
COMMIT;
COMMIT
, the changes can be undone by using ROLLBACK
.SQL allows you to define constraints to ensure the integrity of your data. These constraints can impact how data modification operations behave:
PRIMARY KEY: Ensures that each row in a table is unique and not NULL.
FOREIGN KEY: Ensures referential integrity by ensuring that a value in one table exists in another table.
CHECK: Ensures that the value of a column meets certain conditions (e.g., a salary cannot be negative).
NOT NULL: Ensures that a column cannot contain NULL values.
UNIQUE: Ensures that all values in a column are unique.
Example: Attempt to insert a row with a duplicate primary key will fail if the primary key constraint is violated.
INSERT INTO employees (employee_id, first_name, last_name)
VALUES (1, 'Alice', 'Johnson'); -- Assuming employee_id is a primary key and 1 already exists
employee_id
is the primary key and already contains the value 1
, this operation will fail.Bulk Inserts: If you need to insert many rows, it’s more efficient to insert them in bulk rather than inserting one row at a time.
INSERT INTO employees (first_name, last_name, salary)
VALUES
('John', 'Doe', 50000),
('Jane', 'Smith', 55000),
('Mark', 'Johnson', 60000);
Bulk Updates: Similarly, bulk updates should be done in a single operation to avoid performance overhead.
UPDATE employees
SET salary = salary * 1.1
WHERE department_id = 2;
Using Indexes: For performance reasons, it’s advisable to index columns that are frequently updated or used in conditions. This helps speed up modifications like UPDATE
, DELETE
, or INSERT
queries.
Chapter 7 covers the core SQL statements used to modify data in a relational database. You’ve learned how to use the INSERT
, UPDATE
, and DELETE
statements to add, modify, and remove data, respectively. You also explored advanced techniques such as using TRUNCATE
, working with transactions, and ensuring data integrity through constraints. Additionally, the chapter touches on optimization strategies for bulk operations and the importance of using indexes to improve performance when modifying large datasets. These skills are critical for maintaining and managing data in a live production environment.
In this chapter, we’ll focus on database constraints and data integrity, which are fundamental concepts in relational database management. Constraints ensure that the data in the database remains consistent, valid, and reliable by enforcing rules on the data being inserted, updated, or deleted. Understanding and using constraints properly is crucial for maintaining high-quality data in a database.
Definition: Constraints are rules applied to columns or tables that ensure the accuracy and reliability of the data. They help maintain the integrity of the database by preventing invalid data entries and ensuring relationships between tables are properly maintained.
Types of Constraints:
NOT NULL: Ensures that a column cannot have NULL values.
UNIQUE: Ensures that all values in a column are distinct.
PRIMARY KEY: Ensures that each row in a table is unique and not NULL.
FOREIGN KEY: Enforces a relationship between two tables by ensuring that the value in one table matches a value in another table.
CHECK: Ensures that values in a column meet a specific condition.
DEFAULT: Assigns a default value to a column when no value is specified.
INDEX: Not a constraint per se, but improves the performance of queries by allowing quicker search and retrieval.
Definition: The NOT NULL
constraint ensures that a column cannot have a NULL value. This is useful when you want to ensure that a particular column always contains data.
Syntax:
CREATE TABLE table_name (
column_name data_type NOT NULL
);
Example: Create a employees
table where first_name
and last_name
cannot be NULL.
CREATE TABLE employees (
employee_id INT PRIMARY KEY,
first_name VARCHAR(50) NOT NULL,
last_name VARCHAR(50) NOT NULL,
salary DECIMAL(10, 2)
);
Use Case: If you want to ensure that every employee has a first and last name, the NOT NULL
constraint prevents inserting any row where these columns are left empty.
Definition: The UNIQUE
constraint ensures that all values in a column are unique. Unlike the PRIMARY KEY
constraint, the UNIQUE
constraint allows NULL values (unless combined with NOT NULL
).
Syntax:
CREATE TABLE table_name (
column_name data_type UNIQUE
);
Example: Ensure that email
values in the employees
table are unique.
CREATE TABLE employees (
employee_id INT PRIMARY KEY,
first_name VARCHAR(50),
last_name VARCHAR(50),
email VARCHAR(100) UNIQUE
);
Use Case: If you have an email
column in the employees
table, you would use the UNIQUE
constraint to make sure no two employees can have the same email address.
Definition: The PRIMARY KEY
constraint ensures that each row in a table has a unique identifier. It combines the properties of the NOT NULL
and UNIQUE
constraints. A table can only have one primary key, but the primary key can consist of one or more columns (composite primary key).
Syntax:
CREATE TABLE table_name (
column1 data_type PRIMARY KEY,
column2 data_type
);
Example: Create an employees
table where employee_id
is the primary key.
CREATE TABLE employees (
employee_id INT PRIMARY KEY,
first_name VARCHAR(50),
last_name VARCHAR(50),
salary DECIMAL(10, 2)
);
Composite Primary Key: You can combine multiple columns to form a composite primary key.
CREATE TABLE orders (
order_id INT,
product_id INT,
PRIMARY KEY (order_id, product_id)
);
Use Case: The employee_id
column in the employees
table is used as the primary key, ensuring that each employee has a unique ID.
Definition: The FOREIGN KEY
constraint ensures that a value in one table matches a value in another table, creating a relationship between the two tables. This is a way of enforcing referential integrity between tables.
The FOREIGN KEY
constraint ensures that you cannot insert or update a row in the child table unless the value exists in the parent table.
Syntax:
CREATE TABLE table_name (
column_name data_type,
FOREIGN KEY (column_name) REFERENCES parent_table (parent_column)
);
Example: Create an employees
table where department_id
is a foreign key referencing the departments
table.
CREATE TABLE departments (
department_id INT PRIMARY KEY,
department_name VARCHAR(50)
);
CREATE TABLE employees (
employee_id INT PRIMARY KEY,
first_name VARCHAR(50),
last_name VARCHAR(50),
department_id INT,
FOREIGN KEY (department_id) REFERENCES departments (department_id)
);
Use Case: The department_id
in the employees
table references the department_id
in the departments
table, ensuring that employees are only assigned to valid departments.
On Delete and On Update Actions: You can define actions that happen when a referenced row is deleted or updated.
ON DELETE CASCADE
: Deletes rows in the child table if the referenced row in the parent table is deleted.
ON UPDATE CASCADE
: Updates rows in the child table if the referenced row in the parent table is updated.
Example:
CREATE TABLE employees (
employee_id INT PRIMARY KEY,
department_id INT,
FOREIGN KEY (department_id) REFERENCES departments(department_id)
ON DELETE CASCADE
);
Definition: The CHECK
constraint is used to limit the values that can be inserted into a column. It ensures that all values in a column satisfy a specific condition or rule.
Syntax:
CREATE TABLE table_name (
column_name data_type CHECK (condition)
);
Example: Create an employees
table where the salary must be greater than 0.
CREATE TABLE employees (
employee_id INT PRIMARY KEY,
first_name VARCHAR(50),
last_name VARCHAR(50),
salary DECIMAL(10, 2) CHECK (salary > 0)
);
Use Case: The CHECK
constraint ensures that the salary
column never contains a value less than or equal to 0.
Definition: The DEFAULT
constraint provides a default value for a column when no value is specified during an INSERT
operation. This is useful for setting a default value in cases where a value is optional.
Syntax:
CREATE TABLE table_name (
column_name data_type DEFAULT default_value
);
Example: Create an employees
table where the hire_date
defaults to the current date if not specified.
CREATE TABLE employees (
employee_id INT PRIMARY KEY,
first_name VARCHAR(50),
last_name VARCHAR(50),
hire_date DATE DEFAULT CURRENT_DATE
);
Use Case: If no hire_date
is provided when inserting a new employee, the database will automatically set the hire_date
to the current date.
Definition: An index is a database object that improves the speed of data retrieval operations on a table. While not a constraint itself, indexes are often used in conjunction with constraints, especially on columns that are frequently queried or used in join conditions.
Syntax:
CREATE INDEX index_name
ON table_name (column_name);
Example: Create an index on the last_name
column of the employees
table.
CREATE INDEX idx_last_name ON employees(last_name);
Use Case: Indexes speed up SELECT
queries that filter or sort by the indexed column. However, indexes can slow down INSERT
, UPDATE
, and DELETE
operations because the index needs to be updated whenever the data changes.
Data integrity refers to the accuracy, consistency, and reliability of data stored in a database. Constraints help ensure data integrity by enforcing rules and relationships within the data.
Entity Integrity: Ensured by the PRIMARY KEY
and NOT NULL
constraints.
Referential Integrity: Ensured by the FOREIGN KEY
constraint, which maintains relationships between tables.
Domain Integrity: Ensured by the CHECK
, DEFAULT
, and UNIQUE
constraints, which ensure that data is valid within its domain or range.
CASCADING DELETE and CASCADING UPDATE can be set on FOREIGN KEY
relationships to automatically propagate changes when the referenced data changes. These actions ensure referential integrity is maintained even when records are updated or deleted.
Example: When a department is deleted, all employees in that department should also be deleted.
CREATE TABLE employees (
employee_id INT PRIMARY KEY,
department_id INT,
FOREIGN KEY (department_id) REFERENCES departments(department_id)
ON DELETE CASCADE
);
Chapter 8 covers database constraints and data integrity, crucial components for ensuring the quality and consistency of data within a relational database. You learned about various constraints like NOT NULL
, UNIQUE
, PRIMARY KEY
, FOREIGN KEY
, CHECK
, and DEFAULT
, which enforce rules on data. You also learned about indexing for performance optimization and the importance of enforcing referential and domain integrity. By properly using constraints, you can ensure that your database maintains high data quality and reliability.
In this chapter, we will dive into some of the advanced SQL functions that enhance your ability to manipulate and analyze data. While basic SQL functions like COUNT()
, AVG()
, and SUM()
are commonly used for aggregating data, advanced functions enable more complex operations, transformations, and data manipulation. These functions provide additional power for data analysis, making them essential tools for anyone working with SQL.
String functions allow you to manipulate text and perform operations like concatenation, case conversion, searching, and trimming. These functions are crucial for working with textual data.
CONCAT(): Joins two or more strings together.
SELECT CONCAT(first_name, ' ', last_name) AS full_name FROM employees;
CONCAT
function combines the first_name
and last_name
into a single column full_name
.SUBSTRING(): Extracts a portion of a string.
SELECT SUBSTRING(first_name, 1, 3) AS short_name FROM employees;
first_name
column.UPPER() / LOWER(): Converts a string to uppercase or lowercase.
SELECT UPPER(first_name), LOWER(last_name) FROM employees;
first_name
to uppercase and last_name
to lowercase.LENGTH(): Returns the length (number of characters) of a string.
SELECT first_name, LENGTH(first_name) FROM employees;
first_name
along with its length.TRIM(): Removes leading and trailing spaces from a string.
SELECT TRIM(first_name) FROM employees;
first_name
column.REPLACE(): Replaces occurrences of a substring within a string.
SELECT REPLACE(first_name, 'John', 'Jack') FROM employees;
first_name
column.Date and time functions are essential for working with temporal data. These functions allow you to extract parts of a date, calculate intervals, and format dates for reporting.
CURRENT_DATE / CURRENT_TIME / CURRENT_TIMESTAMP: Returns the current date, time, or both.
SELECT CURRENT_DATE, CURRENT_TIME, CURRENT_TIMESTAMP;
NOW(): Returns the current date and time.
SELECT NOW();
DATEADD() / DATEDIFF(): These functions are used to perform date arithmetic.
DATEADD(): Adds a specific time interval to a date.
SELECT DATEADD(day, 7, '2025-05-01'); -- Adds 7 days to the given date
DATEDIFF(): Calculates the difference between two dates.
SELECT DATEDIFF('2025-05-11', '2025-01-01'); -- Returns the number of days between the two dates
EXTRACT(): Extracts a specific part (e.g., year, month, day) from a date.
SELECT EXTRACT(YEAR FROM hire_date) AS hire_year FROM employees;
DATE_FORMAT() (for MySQL) or TO_CHAR() (for PostgreSQL/Oracle): Formats a date into a specific string format.
SELECT DATE_FORMAT(hire_date, '%Y-%m-%d') AS formatted_date FROM employees;
NOW() and INTERVAL: Adding or subtracting time using INTERVAL
.
SELECT NOW() - INTERVAL 1 MONTH; -- Subtracts 1 month from the current date
DAYNAME(), MONTHNAME(): Extracts the name of the day or month.
SELECT DAYNAME(hire_date), MONTHNAME(hire_date) FROM employees;
Mathematical functions allow you to perform calculations and work with numeric data.
ROUND(): Rounds a number to a specified number of decimal places.
SELECT ROUND(salary, 2) FROM employees;
salary
column to two decimal places.CEIL() / FLOOR(): Rounds a number up (to the nearest integer) or down (to the nearest integer).
SELECT CEIL(salary), FLOOR(salary) FROM employees;
ABS(): Returns the absolute value of a number.
SELECT ABS(-100) FROM employees;
RAND() / RANDOM(): Generates a random number between 0 and 1.
SELECT RAND() FROM employees;
POW() / POWER(): Raises a number to a specified power.
SELECT POWER(2, 3); -- 2 raised to the power of 3
SQRT(): Returns the square root of a number.
SELECT SQRT(16); -- Returns 4
Conditional functions enable you to return different results based on certain conditions, similar to if-else
statements in programming.
CASE / CASE WHEN: Provides conditional logic directly in a query, returning different results based on conditions.
SELECT first_name, salary,
CASE
WHEN salary > 60000 THEN 'High'
WHEN salary BETWEEN 40000 AND 60000 THEN 'Medium'
ELSE 'Low'
END AS salary_category
FROM employees;
COALESCE(): Returns the first non-NULL value from a list of arguments.
SELECT COALESCE(NULL, NULL, 'Hello', 'World') AS result; -- Returns 'Hello'
NULLIF(): Returns NULL
if two expressions are equal, otherwise returns the first expression.
SELECT NULLIF(salary, 0) FROM employees;
Window functions allow you to perform calculations over a specific range of rows related to the current row, without collapsing the result into a single summary row. These are especially useful for running totals, ranking, and moving averages.
ROW_NUMBER(): Assigns a unique number to each row within a partition, starting at 1.
SELECT first_name, salary, ROW_NUMBER() OVER (ORDER BY salary DESC) AS rank
FROM employees;
RANK() / DENSE_RANK(): Assigns a rank to each row, with RANK()
giving gaps between equal values, and DENSE_RANK()
giving consecutive ranks.
SELECT first_name, salary, RANK() OVER (ORDER BY salary DESC) AS salary_rank
FROM employees;
NTILE(): Divides the result set into a specified number of roughly equal parts (bins).
SELECT first_name, salary, NTILE(4) OVER (ORDER BY salary DESC) AS quartile
FROM employees;
LEAD() / LAG(): Accesses the value of a row in the result set that comes before (LAG) or after (LEAD) the current row.
SELECT first_name, salary, LEAD(salary, 1) OVER (ORDER BY salary) AS next_salary
FROM employees;
SUM() OVER(): Computes a cumulative sum across a range of rows.
SELECT first_name, salary, SUM(salary) OVER (ORDER BY salary) AS running_total
FROM employees;
In addition to standard aggregation functions like COUNT()
, SUM()
, and AVG()
, there are other advanced ways to aggregate and group data.
GROUP_CONCAT() (MySQL) / STRING_AGG() (PostgreSQL, SQL Server): Concatenates values from multiple rows into a single string.
SELECT department_id, GROUP_CONCAT(first_name ORDER BY first_name) AS employee_names
FROM employees
GROUP BY department_id;
ARRAY_AGG() (PostgreSQL): Aggregates values into an array.
SELECT department_id, ARRAY_AGG(first_name) AS employee_names
FROM employees
GROUP BY department_id;
JSON_AGG() (PostgreSQL): Aggregates rows into a JSON array.
SELECT department_id, JSON_AGG(first_name) AS employee_names
FROM employees
GROUP BY department_id;
Example: Creating a simple function to calculate a bonus based on salary.
CREATE FUNCTION calculate_bonus(salary DECIMAL)
RETURNS DECIMAL AS $$
BEGIN
RETURN salary * 0.1;
END;
$$ LANGUAGE plpgsql;
Using the UDF:
SELECT first_name, calculate_bonus(salary) AS bonus
FROM employees;
Chapter 9 covers advanced SQL functions that give you enhanced control over data manipulation and analysis. You’ve learned how to work with string functions, date and time functions, mathematical functions, conditional functions, and window functions. Additionally, we explored advanced aggregation techniques like GROUP_CONCAT()
and STRING_AGG()
, as well as creating your own user-defined functions (UDFs) for custom logic. These advanced functions are vital for performing complex data analysis, reporting, and transformation tasks. Mastery of these functions allows you to leverage the full power of SQL for sophisticated database operations.
In this chapter, we will explore database design and normalization, two crucial concepts that help you create efficient, scalable, and maintainable databases. Proper database design ensures that your data is organized in a way that reduces redundancy, improves data integrity, and optimizes performance. Normalization is a key part of this process, as it helps break down complex data structures into simpler, more manageable tables while maintaining relationships between them.
Database Design is the process of creating the structure of a database, which includes defining tables, columns, data types, relationships, and constraints. Good database design is essential for ensuring that data is stored efficiently and is easy to query, update, and maintain.
Goals of Database Design:
Data Integrity: Ensuring that data is accurate, consistent, and reliable.
Efficiency: Organizing data in a way that minimizes storage space and maximizes retrieval speed.
Scalability: Ensuring that the database can grow and handle larger amounts of data.
Maintainability: Designing the database in a way that is easy to update, modify, and scale over time.
Entities: Entities are objects or concepts that have data stored about them. For example, in a database for an e-commerce system, Customers
, Orders
, and Products
are entities.
Attributes: Attributes are the data fields that describe an entity. For example, a Customer
entity might have attributes such as CustomerID
, FirstName
, LastName
, and Email
.
Relationships: Relationships describe how entities are connected. In a relational database, relationships are usually represented using foreign keys. For example:
A one-to-many relationship between Customers
and Orders
: A customer can place many orders, but each order is placed by only one customer.
A many-to-many relationship between Orders
and Products
: An order can contain multiple products, and each product can appear in multiple orders. This is typically represented by creating a junction table to resolve the many-to-many relationship.
Example of a Relationship:
Customers
(CustomerID, FirstName, LastName)
Orders
(OrderID, OrderDate, CustomerID) — CustomerID is a foreign key referencing Customers
Normalization is the process of organizing data in a way that eliminates redundancy and dependencies. By following the rules of normalization, you can design databases that are easier to maintain, avoid anomalies, and improve performance.
Normal Forms (NF): Normalization involves splitting a database into multiple tables to achieve the following stages, called normal forms. There are several normal forms, but the first three (1NF, 2NF, and 3NF) are the most commonly used.
1NF Definition: A table is in First Normal Form (1NF) if it contains only atomic values (i.e., indivisible values), and each record (row) is unique.
Issues Addressed by 1NF:
Repeating Groups: In a table, a column should not contain multiple values. For example, a PhoneNumbers
column should not store multiple phone numbers in a single cell.
Atomic Values: Each column should hold a single, indivisible value.
Example:
Before 1NF:
EmployeeID | Name | PhoneNumbers
------------------------------------------
1 | John Smith | 123-4567, 234-5678
2 | Jane Doe | 345-6789
After 1NF (splitting the phone numbers into separate rows):
EmployeeID | Name | PhoneNumber
----------------------------------------
1 | John Smith | 123-4567
1 | John Smith | 234-5678
2 | Jane Doe | 345-6789
2NF Definition: A table is in Second Normal Form (2NF) if it is in 1NF and all non-key attributes are fully functionally dependent on the primary key. In other words, if a table has a composite primary key (a primary key consisting of multiple columns), every non-key attribute must depend on all of the columns in the primary key, not just part of it.
Issues Addressed by 2NF:
Example:
Before 2NF (composite primary key of StudentID
and CourseID
):
StudentID | CourseID | Instructor | InstructorPhone
-----------------------------------------------------
1 | 101 | Dr. Smith | 123-4567
1 | 102 | Dr. Johnson | 234-5678
2 | 101 | Dr. Smith | 123-4567
After 2NF (splitting the instructor information into a separate table):
Enrollments
table:
StudentID | CourseID
---------------------
1 | 101
1 | 102
2 | 101
Instructors
table:
CourseID | Instructor | InstructorPhone
------------------------------------------
101 | Dr. Smith | 123-4567
102 | Dr. Johnson | 234-5678
3NF Definition: A table is in Third Normal Form (3NF) if it is in 2NF and all of its non-key attributes are not transitively dependent on the primary key. In other words, there should be no indirect dependencies between non-key attributes.
Issues Addressed by 3NF:
Example:
Before 3NF:
EmployeeID | DepartmentID | DepartmentName | DepartmentLocation
---------------------------------------------------------------
1 | 101 | HR | Building A
2 | 102 | IT | Building B
After 3NF (moving DepartmentName
and DepartmentLocation
into a separate table):
Employees
table:
EmployeeID | DepartmentID
-------------------------
1 | 101
2 | 102
Departments
table:
DepartmentID | DepartmentName | DepartmentLocation
---------------------------------------------------
101 | HR | Building A
102 | IT | Building B
BCNF Definition: A table is in Boyce-Codd Normal Form (BCNF) if it is in 3NF and if every determinant is a candidate key. A determinant is any attribute that can uniquely determine other attributes in the table.
BCNF is a stricter version of 3NF and deals with certain types of anomalies that can still occur in 3NF.
Example:
4NF Definition: A table is in Fourth Normal Form (4NF) if it is in BCNF and there are no multi-valued dependencies. A multi-valued dependency occurs when one attribute in a table determines multiple independent attributes.
Issues Addressed by 4NF:
Definition: Denormalization is the process of combining tables that were split during normalization to improve read performance. This is typically done when performance needs outweigh the benefits of strict normalization, especially in data warehousing or reporting applications.
Example: A query that joins multiple tables and performs complex aggregation might be slow, so denormalizing the data to store aggregated values can improve performance.
Chapter 10 explores the principles of database design and normalization. We’ve discussed the importance of organizing data into well-defined tables, eliminating redundancy, and maintaining relationships between entities. By applying normalization techniques (1NF, 2NF, 3NF, BCNF, 4NF, 5NF), we ensure that the database is free from anomalies and maintains data integrity. Additionally, we explored denormalization, which may be used in certain cases to improve performance by reducing the need for complex joins and queries. Proper database design and normalization lead to efficient, reliable, and maintainable databases that scale as your application grows.
In this final chapter, we’ll cover some practical tips and tricks that will help you become more efficient with SQL. These tips will improve your query writing, performance optimization, and database management skills. They will also give you insights into best practices, common pitfalls, and ways to troubleshoot SQL queries and manage large datasets.
Tip: Aliases can make your queries easier to read and maintain, especially when working with complex joins or derived columns.
Example: Instead of writing long table names repeatedly, use aliases.
SELECT e.first_name, e.last_name, d.department_name
FROM employees AS e
INNER JOIN departments AS d ON e.department_id = d.department_id;
EXPLAIN
for Query PerformanceTip: Use the EXPLAIN
or EXPLAIN ANALYZE
keyword before your query to view the query execution plan. This can help you identify performance bottlenecks such as full table scans or inefficient joins.
Example:
EXPLAIN SELECT first_name, last_name FROM employees WHERE salary > 50000;
Why it helps: This will show you how SQL is executing your query, which indexes are used, and how tables are joined. It’s invaluable for performance tuning.
SELECT *
in Production QueriesTip: Avoid using SELECT *
unless necessary. Always specify the columns you need.
Why it helps: Selecting unnecessary columns increases I/O and can slow down your query, especially if the table is large. It also ensures better control over the data retrieved.
Example:
SELECT first_name, last_name, salary FROM employees;
JOIN
Instead of Subqueries When PossibleTip: In most cases, a JOIN
is faster than a subquery because subqueries may result in multiple executions, which can impact performance.
Why it helps: Joins allow the database engine to perform optimizations, reducing query time.
Example:
-- Using JOIN
SELECT e.first_name, e.last_name, d.department_name
FROM employees e
INNER JOIN departments d ON e.department_id = d.department_id;
-- Using Subquery
SELECT first_name, last_name
FROM employees
WHERE department_id = (SELECT department_id FROM departments WHERE department_name = 'Sales');
Tip: Index columns that are frequently used in WHERE
, JOIN
, or ORDER BY
clauses. Indexing speeds up data retrieval, but be mindful of adding too many indexes, as they can slow down INSERT
, UPDATE
, and DELETE
operations.
Example:
CREATE INDEX idx_employee_department ON employees(department_id);
CASE
for Conditional LogicTip: Use the CASE
statement to implement conditional logic directly in your queries, rather than performing these operations in your application logic.
Example:
SELECT first_name, last_name,
CASE
WHEN salary > 100000 THEN 'High'
WHEN salary BETWEEN 50000 AND 100000 THEN 'Medium'
ELSE 'Low'
END AS salary_category
FROM employees;
GROUP BY
QueriesTip: When using GROUP BY
, always ensure that the grouping column is indexed if possible. This will help in reducing the time required for grouping and aggregation.
Example:
SELECT department_id, COUNT(*) AS employee_count
FROM employees
GROUP BY department_id;
Tip: Be careful when dealing with NULL
values, as they can lead to unexpected results. Always use IS NULL
or IS NOT NULL
when filtering for NULL
values.
Example:
SELECT first_name, last_name
FROM employees
WHERE middle_name IS NULL;
Tip: For large datasets, it’s better to batch INSERT
, UPDATE
, and DELETE
operations to minimize the load on the database.
Example:
Instead of running multiple INSERT
statements:
INSERT INTO employees (first_name, last_name, salary) VALUES ('John', 'Doe', 50000);
INSERT INTO employees (first_name, last_name, salary) VALUES ('Jane', 'Doe', 60000);
INSERT INTO employees (first_name, last_name, salary)
VALUES ('John', 'Doe', 50000), ('Jane', 'Doe', 60000);
LIMIT
and OFFSET
for PaginationTip: When dealing with large result sets, use LIMIT
and OFFSET
for pagination to improve performance and reduce memory usage.
Example:
SELECT * FROM employees LIMIT 10 OFFSET 0; -- First page
SELECT * FROM employees LIMIT 10 OFFSET 10; -- Second page
Tip: Avoid joining large tables without proper indexing or filtering. If possible, use smaller, pre-filtered datasets before performing joins.
Why it helps: Joining large tables without filtering or indexing can result in very slow queries.
Best Practice: Apply filters (WHERE
conditions) early to limit the size of the datasets being joined.
Tip: Avoid using functions (like LOWER()
, UPPER()
, or DATE()
) in the WHERE
clause as they can disable the use of indexes, leading to slower queries.
Why it helps: Functions in the WHERE
clause cause the database to scan all rows rather than using indexes effectively.
Example:
-- Bad practice (using functions in WHERE clause)
SELECT * FROM employees WHERE LOWER(first_name) = 'john';
-- Better practice
SELECT * FROM employees WHERE first_name = 'John';
UNION
vs UNION ALL
WiselyTip: Use UNION ALL
when you don’t need to remove duplicates, as it’s faster than UNION
because UNION
performs an additional step of removing duplicates.
Example:
UNION
(removes duplicates):
SELECT first_name FROM employees WHERE department_id = 1
UNION
SELECT first_name FROM employees WHERE department_id = 2;
UNION ALL
(does not remove duplicates):
SELECT first_name FROM employees WHERE department_id = 1
UNION ALL
SELECT first_name FROM employees WHERE department_id = 2;
Tip: For read-heavy applications, such as reporting, consider denormalizing your data to reduce the need for complex joins. This can improve performance, though it comes at the cost of increased storage and potential for data inconsistency.
Example: Store pre-aggregated data for reporting purposes to avoid recalculating the same values multiple times.
Tip: Always use transactions (BEGIN
, COMMIT
, ROLLBACK
) when performing multiple data-modifying operations. This ensures that either all changes are applied or none, preserving data consistency.
Example:
BEGIN TRANSACTION;
UPDATE employees SET salary = salary + 5000 WHERE department_id = 3;
INSERT INTO audit_log (action, timestamp) VALUES ('Salary Update', NOW());
COMMIT; -- If everything works as expected
-- or
ROLLBACK; -- If something goes wrong
JOIN
Queries with IndexesTip: When joining large tables, ensure the columns involved in the JOIN
condition are indexed. This can drastically improve query performance.
Example:
CREATE INDEX idx_employee_department ON employees(department_id);
CREATE INDEX idx_department_id ON departments(department_id);
Chapter 11 offers a range of SQL tips and tricks to help you write efficient queries, optimize performance, and manage large datasets effectively. These best practices will help you avoid common pitfalls, ensure data integrity, and work more efficiently with SQL. By following these guidelines, you can improve the performance of your queries, make your database more maintainable, and develop better habits as you work with SQL on a daily basis.