Fundamentals of the Relational Model
Background to T-SQL Querying and Programming
Theoretical background
- SQL
- Set Theory
- Predicate Logic
- The relational model
-
- Propositions, predicates, and relations
- Missing values
- Constraints
- Normalization
-
-
- 1NF
- 2NF
- 3NF
-
- Types of Database systems
- Online transactional processing
- Data warehouses
SQL Server Architecture
- The ABCs of Microsoft RDBMS flavors
-
- Box
- Appliance
- Cloud
- SQL Server instances
- Databases
- Database layout
- Schemas and objects
Creating tables and defining data integrity
- Creating tables
- Defining data integrity
-
- Primary key constraints
- Unique constraints
- Foreign-key constraints
- Check constraints
- Default constraints
Theoretical background
A relational database management system (RDBMS) is based on the relational model, which in turn is based on two branches of mathematics: Set Theory and Predicate Logic.
SQL
Data in a RDBMS can be managed and queried using the SQL language, and Microsoft’s dialect and extension of SQL is T-SQL. SQL is a declarative language that is both ANSI and ISO standard. A declarative language specifies what we want and not how to get it (as with procedural languages). The RDBMS determines the mechanics required to process the request. The three main categories of SQL statements are:
- DDL (Data Definition Language): Allows the user to define or modify data structures or objects, and includes statements such as CREATE, ALTER, and DROP.
- DML (Data Manipulation Language): Allows the user to query and manipulate the data in the tables of a database, and include statements such as SELECT, INSERT, UPDATE, DELETE, TRUNCATE, and MERGE.
- DCL (Data Control Language): Contains only two statements: GRANT and REVOKE. They are used to manage the rights users have in a database
DDL stands for Data Definition Language and includes commands used to define the data structure.
DML stands for Data Manipulation Language and those commands are all about how to edit the data.
TCL stands for Transaction Control Language which includes commands about working with database transactions.
Set Theory
A set is on a single entity representing a collection of objects. In set theory, every element in a set must be unique, and this can be defined, and the ordering of the elements in the set is irrelevant. This means a query against a table will return a result set with rows presented in any order unless the ordering of the rows is explicitly specified using the ORDER BY clause.
Predicate Logic
A predicate is an expression that will be either true or false when that expression is attributed to a set.
In the relational model, predicates are applied as some sort of logic to attributes to define the structure and maintain the logical integrity of the data. Predicates are used to filter data to define sets or subsets. We can accomplish this by first defining the propositions to be evaluated. For example, a predicate can be used to define a in the Production.Product table to only allow only those records to be stored where the days to manufacture is greater than 0:
USE AdventureWorks2017; SELECT productid, name, daystomanufacture FROM production.product WHERE daystomanufacture> 0;
Here, the proposition is “greater than 0” and the predicate is “days to manufacture is greater than 0.” On those records for which the proposition evaluates to True will be returned as the result set.
The Relational Model
As previously mentioned, the RDBMS relies on the relational model to enforce data integrity while also allowing the ability to store, manage, and query data. Data integrity is important because it ensures consistency of the data while keeping data redundancy minimal or absent, without sacrificing completeness.
The relational model does not refer to relationships between tables. Rather, each individual table in SQL is a relation. A relation is made up of a heading and body. A heading is a set of attributes, also known as columns of a table. An attribute comprises of a name and type. The body is a set of tuples, also known as rows of a table, and these can be identified by keys.
Propositions, predicates, and relations
A data proposition is an expression that must be either True or False; only those rows for which the proposition evaluates as True will return as the result set table. Rows that are evaluate as False will never manifest, a presumption known as closed-world assumption (CWA). Predicates are parameterized propositions.
As previously mentioned, a table comprises of a heading (a set of attributes) and body (a set of tuples). In the relational model, all attributes are unordered and distinct, and they are defined by a name and data type. A type is a simple form of predicate, where the values for an attribute are constrained to a certain set of possible or valid values. For example, the INT data type constrains all values in the attribute to be an integer between -2147483648 and 2147483647.
Missing Values
Two-valued predicate logic limits values to be True or False. However, missing values are neither True or False; they are evaluated as Unknown. SQL supports the implementation of three-valued predicate logic, where a predicate can be evaluated as either True, False, or Unknown. Edward Codd took this further and evangelized four-valued predicate logic, which considers two different cases for missing values: missing but applicable (A-Values marker) and missing but inapplicable (I-Values marker). SQL supports only the generic instance of missing values and does not recognize the two distinctions for cases for the missing values.
Normalization
Normalization rules can be enforced to 1.) avoid any anomalies during the data modification process, and 2.) minimize data redundancy without sacrificing completeness. Codd introduced the first three normal forms – 1NF, 2NF, and 3NF:
- 1NF: A table is already in the first normal form if it is truly a relation. In the first normal form, all rows (tuples) in the table (relation) must be unique and at each column (attributes) must contain atomic values. The uniqueness of rows can be achieved by defining a unique key for a table. For an attribute to be atomic, it must be indivisible and irreducible. Atomicity is subjective. For example, is an address attribute is considered atomic for an application, not including the city as part of the address would be a violation of the first normal form.
- 2NF: For a table to comply with the second normal form, the table must already be in the first normal. The second rule for achieving 2NF is that each non-key attribute must be fully functional on the table’s primary key. In other words, each field that is not the primary key is determined by that primary key, so it is specific to that record. We can determine the value of any attribute of any record as long as we know the attribute values of the primary key.
- 3NF: For a table to achieve the third normal form, it must first meet the second normal form. The second rule is that all of the columns of the table are not transitively dependent on the primary key. Every attribute that is not the primary key must depend on the primary key and the primary key only.
SQL and T-SQL enables all the normal forms to be violated when developing tables. There are several additional normal forms like Boyce-Codd, but for most database normalization exercises, stopping at 3NF is enough.
https://tapoueh.org/blog/2018/03/database-normalization-and-primary-keys/
Creating Tables and defining data integrity
Tables reside within schemas, and schemas reside within databases. By default, a schema named dbo is automatically created in every database unless specified otherwise. dbo is also the default schema for users not explicitly associated with a different schema.
Creating Tables
USE AdventureWorks2017; DROP TABLE IF EXISTS dbo.Employees; CREATE TABLE dbo.employees (empid INT NOT NULL, firstname VARCHAR(30) NOT NULL, lastname VARCHAR(30) NOT NULL, hiredate DATE NOT NULL, mgrid INT NOT NULL, ssn VARCHAR(20) NOT NULL, salary MONEY NOT NULL );
When creating tables, it is recommended to use the USE statement to ensure that SQL Server is creating objects in the correct database context.
The DROP IF EXISTS command is used to drop a table if it exists. The following is an alternative to the above query:
IF OBJECTID(N’dbo.Test’,N’U’) IS NOT NULL DROP TABLE dbo.Test;
The query above looks to see if the dbo.Test table exists in the current database, and drops it if that table exists. SQL Server returns a NULL if the table does not exist. The OBJECT_ID functions accepts the object name and type as inputs, and the type U means the object is a user table.
The CREATE TABLE statement is used to specify the name of the table we are creating, and within the parentheses we define the columns with their names, data type, and nullability.
If we omit the schema name when creating the table, SQL Server will assume the default schema that is associated with the database user creating the table. For stored procedure, SQL Server will assume the schema associated with the procedure’s owner.
If we don’t explicitly specify an attribute’s nullability, SQL Server by default will assume the attribute accepts NULLs, but there are settings we can modify to change this behavior. It is good practice to not rely on defaults, and define an attribute as NOT NULL unless there’s a compelling reason otherwise.
It is also good practice to terminate each statement with a semi-colon. While this is not a requirement in several database systems, not ending queries with a semi-colon is a deprecated feature in T-SQL.
Defining Data Integrity
Procedural data integrity is when we enforce data integrity with code such as with triggers or stored procedures. Conversely, declarative data integrity is when we enforce data integrity as part of defining the table we are creating. Constraints are rules that are defined in the data model and enforced by the RDBMS. Examples of enforcing declarative data integrity constraints include setting data types and nullability (whether or not it supports NULLs) for attributes, as well as defining primary, unique, foreign, check, and default constraints. We can define these constraints as part of the CREATE TABLE statement, but if the tables are already created we can define the constraints using the ALTER TABLE statement. We define composite key when the constraints are based on more than one attribute. Composite key constraints can be defined with all constraint types except for the default constraint.
Primary-Key constraints
Candidate keys provide entity integrity, and can be defined on one or more attributes. When using multiple candidate keys, one of the candidate keys is assigned as the primary key while all other candidate keys are known as the alternate keys. Primary-key constraints enforce the uniqueness of rows, and prevent the occurrence of duplicate rows in a table. Primary keys also disallow NULLs in the constraint attributes. Each unique set of values can only appear in one row per table. Only one primary key is allowed per table.
ALTER TABLE dbo.employees ADD CONSTRAINT PK_Employees PRIMARY KEY(empid);
To enforce the uniqueness of the primary-key constraint, SQL Server creates a physical mechanism called a unique index behind the scenes. Unique indexes and other types of indexes are also used to speed up queries by avoiding full table scans.
Unique constraints
Uniqueness of rows are also enforced through the use of alternate key constraints via unique constraints. Unlike primary keys, we can define multiple unique key constraints for the same table. Also unlike primary keys, unique key constraints allow NULLs. Standard SQL allow multiple NULLs (as if two NULLs are different from each other) for columns with unique keys. SQL Server assume all NULLs are equal, and duplicate NULLs are rejected. To emulate the standard SQL unique constraint behavior, we use a unique filtered index on SQL Server that filters only non-NULL values. For example, suppose we have a column containing SSN information and allowed NULLs, and we create such an index instead of a unique constraint, here’s how the query would go:
ALTER TABLE dbo.employees ADD CONSTRAINT UNQ_Employees_ssn UNIQUE(ssn);
We define the unique index, and filter out all SSN that are NULLs from the index so duplicate NULLs will be allowed whereas duplicate non-NULLs are not.
Foreign-key constraints
A foreign key enforces referential integrity. The foreign key is defined on one or more attributes in the referencing table, and it references a primary-key or unique constraint attributes in the referenced table (which can a different or possibly same table). The foreign key restricts the values allowed in that column to only values that exist in the referenced column.
-- Code for creating the demo table dbo.Orders CREATE TABLE dbo.Orders (orderid INT NOT NULL, empid INT NOT NULL, custid VARCHAR(10) NOT NULL, orderts DATETIME2 NOT NULL, qty INT NOT NULL, CONSTRAINT PK_Orders PRIMARY KEY(orderid));
Suppose we wish to restrict the values supported by the empid column in the Orders table to values that appear in the empid column in the Employees table. We can accomplish this by defining a foreign-key constraint on the empid column in the Orders table referencing the empid column of the Employees table This is how the query would go:
-- Use AdventureWorks2017 database USE AdventureWorks2017; -- Create foreign key for dbo.Orders ALTER TABLE dbo.Orders ADD CONSTRAINT FK_Orders_Employees FOREIGN KEY(empid) REFERENCES dbo.Employees(empid);
Similarly, we can write the following query to restrict the values supported by the mgrid column in the Employees table to the values that exist in the empid column of the same table:
ALTER TABLE dbo.Employees FOREIGN KEY(mgrid) REFERENCES dbo.Employees(empid);
The preceding two examples are examples of foreign keys that enforce a referential action called no action. This means any attempt to delete rows from the referenced table or update the referenced primary or unique key columns will be rejected if related rows exists in the referencing table. Such attempts can be compensated for by defining the foreign key with actions, to delete rows from the referenced table or update the referenced primary or unique key column when related rows exist in the referencing table. We can define the options ON DELETE and ON UPDATE with actions such as CASCADE, SET DEFAULT, and SET NULL as part of foreign-key definition.
- CASCADE: The delete or update operation will be cascaded to related rows. For example, ON DELETE CASCADE means when we delete a row from the referenced table, related rows from the referenced table will also be deleted.
- SET DEFAULT and SET NULL: The compensating action will set the foreign-key attributes of the related rows to default value of the column or NULL, respectively. Regardless of what action is choses, the referencing table will only have orphan rows in the case of the exception with NULLs. Parents with no children rows are always allowed.
Check constraints
Using a check constraint, we can define a predicate a row must meet in order to be entered into a table or be modified.
ALTER TABLE dbo.Employees ADD CONSTRAINT CHK_Employees_salary CHECK(salary>0.00);
Any attempt to insert or update a row when the predicate evaluates to FALSE will be rejected by the check constraint. Modifications are only accepted when the predicate evaluates to either TRUE or UNKNOWN (if the column is nullable). SQL is based on the three-valued logic, resulting in two actual actions. A rows is either accepted or rejected with a check constraint.
When adding check and foreign-key constraints, we can specify an option called WITH NOCHECK. WITH NOCHECK allows the RDBMS to bypass constraint checking for existing data. This is considered bad practice because we cannot ensure data consistency. We can also disable or enable existing check and foreign-key constraints.
Default constraints
A default constraint can be used to input a default value when one is not specified for the attribute when inserting a row. For example, we can write the following code to define the default constraint for the OrdersTest attribute:
ALTER TABLE dbo.Orders ADD CONSTRAINT DFT_Orders orderts DEFAULT(SYSDATETIME()) FOR orderts;
In the above code, the default constraint invokes the SYSDATETIME function, which returns the current data and time value. This means whenever a row is inserted in the Orders table and no value is explicitly specified in the OrdersTest attribute, SQL Server will set the attribute value to SYSDATETIME.
Lastly, the following code is executed for cleanup:
-- Drop demo tables dbo.Orders and dbo.Employees DROP TABLE IF EXISTS dbo.Orders, dbo.Employees;