The goal of this tutorial is to give a quick introduction how to use the relational algebra calculator and its concepts. It assumes that you already know the relational algebra or are learning it from other sources.
There is no real standard for the relational algebra like there is for SQL. So every book or teacher
might have its slightly different interpretation and notation.
The goal of this progam was to support the most commonly used "mathematical" notation used by Database Systems The Complete Book 2nd
Edition by Hector Garcia-Molina, Jeff Ullman, and Jennifer Widom, Datenbanksysteme: Eine Einführung
by Alfons Kemper and André Eickler and Wikipedia - Relational Algebra and
others.
The core element of the calculator is the relation (or table) which consists of a fixed number of attributes (or columns) in a fixed order (this is called the schema of the relation) and a set of tuples or rows containing the specific values.
Each attribute has three distinct properties: its type, its position and its name.
The type or domain of an attribute is either string, number, date or boolean.
The type is used for example to determine if two values can be compared in a boolean expression or if
two schemas are union compatible. In most cases the type of the attributes are obvious if you look at
the values.
The position of each attribute in a schema is fixed and can be used to adress the attributes.
An example would be the projection of the first and third attribute or column of an arbitrary relation
R: π [1], [3] ( R )
The full qualified name of the attribute is a unique identifier of the attribute within the schema of its
relation.
It consists of the name itself and a relation qualifier and are written like in SQL as R.a
where a is the name and R is the relation qualifier.
An example would be the projectoin of the attributes a, b from a relation R: π R.a, R.b ( R )
The default relation qualifier of each attribute is the name of its relation.
If the attributes name without the qualifier is unique within the relation's schema, it also can be used
to address a specific attribute.
The previous example could also be written as π a, b ( R )
.
Each relation has a set of tuples (or rows). This means that there are no duplicate tuples within one
relation and the duplicate-elimination is implicitly executed after every single step of the
calculation.
The tuples in the calculator have a defined order and unlike a normal database system all operations are
implemented to preserve that order. This should help the users to see what has changed from one step to
the next.
so far we covered that
π [1], [2] ( R )
,π a, b ( R )
π R.a,
S.a ( R x S )
Before we introduce how to use the operators this should be a quick introduction of a very handy feature of the relational algebra calculator: the alternative plain text notation
The "classic" mathematical notation uses greek letters like π, σ for the unary operations and
special symbols like the join symbol ⋈ or the union symbol ∪
for some binary operations.
This symbols can be entered using the toolbar at the top of the editor.
This calculator also supports a alternative syntax for all this symbols that follows two very simple rules: Every greek letter can be substituted with its name spelled out ("pi" for π, "gamma" for γ) and every other symbol has an equivalent name that is borrowed from SQL, programming languages like C and Set theory.
This substitutions should be easy to read and much more important very easy to write because you don't
need any toolbar or mouse. The calculator also supports autocomplete: just press [CTRL]+[SPACE] to
complete the current keyword.
This feature should help you to write your statements more quickly and fluently.
π R.a, S.a, S.b
σ R.a = S.a ∧ ( R.a > 5 ∨ R.a < 0 ) (
R ⨯ S
)
is equivalent to:
pi R.a, S.a, S.b
sigma R.a = S.a and ( R.a > 5 or R.a < 0 ) (
R cross join S
)
In the following table you can see a list of all supported substitutions:
classical notation | alternative notation |
---|---|
π | pi |
σ | sigma |
ρ | rho |
τ | tau |
γ | gamma |
∩ | intersect |
∪ | union |
- | \ |
÷ | / |
⨯ |
|
⋈ |
|
⟕ |
|
⟖ |
|
⟗ | full outer join |
⋉ | left semi join |
⋊ | right semi join |
▷ |
|
← | <-
pi new_name <- a ( R )
|
→ | ->
pi a -> new_name ( R )
|
For this Part we use the "bank example" Dataset with 3 relations: Customers, Accounts and PremiumCustomers. By convention relations start with a uppercase letter and attributes with a lower case letter.
Open the dataset used in this tutorial using the following link to the "bank example" Dataset.
You find the relations and their attributes listed on the side and if you hover a relations name a preview of the first view tuples is displayed.
After you have found the Dataset you can formulate the very first and most basic query in relational algebra: a relation without any further manipulation.
Just enter the name of a relation into the code editor or click on the relation/attribute names to insert
them into the code editor.
Note that the editor supports auto completing the relation/attribute names
of the current dataset and the operators with [CTRL]+[SPACE]
So if you want all tuples of the relation Customer you should have the following statement: Customer
.
And get all the tuples if you press the execute button or press [CTRL]+[RETURN].
All unary operations have the same basic syntax FUNCTION ARGUMENT (
CHILD_EXPRESSION )
.
The braces around the CHILD_EXPRESSION
can be omitted. In this case the
predefined operator precedence for relational algebra applies.
A complete list of the supported relalg operations can be found here: general syntax, unary operations and binary operations.
The projection is one of the basic operations that allow to choose which of the attributes of the parent relations should be included in the new one and in which order they should be.
Renaming a relation (ρ) changes the qualifiers of all the relations attributes but does not touch the tuples.
Renaming an attribute (ρ) only changes the name of a specific attribute (and leaves his relation-qualifier unchanged).
The statement pi balance ( Accounts )
returns a new relation with only the balance
attribute.
The next statement gets the balance with the account-id after renaming the relation to A and renames one of the attributes.
rho account_number <- aid (
pi aid, A.balance (
rho A (
Accounts
)
)
)
Like in SQL or most programming languages you can format your statement and use SQL like
comments (with -- ...
or /* ... */
) to increase the readability.
The next statement uses a selection to filter the tuples of a relation based on a boolean expression. The
calculator supports complex boolean expression with functions and built in operator precedence.
The attributes in the boolean expression can be given as name or numeric position like with the
projection.
The next statement selects all customers-ids of customers who have the same value for their firstname and lastname.
-- this should return an relation with no tuples:
pi cid (
sigma firstname=lastname ( Customers )
)
The next example uses a more complex expression to get all accounts with a balance over 100 or under -100.
sigma balance > 100 or (balance*-1 > 100) ( Accounts )
-- (balance < -100) would also be correct
As a shorter alternative you can use a function in your expression to get the same result:
sigma abs(balance) > 100 ( Accounts )
Everybody can provide datasets that can be used in the relational algebra calculator and share them with
others.
We assume the scenario of a teacher wanting to provide a dataset for his/her students for this short
tutorial.
The datasets are specified in a simple text format and can be shared with others via GitHub Gists (a simple and free platform to share snippets).
The shared gist gets an unique ID and the relational algebra calculator can load the dataset directly
using this ID.
The fist step is the creation of the dataset which is actually only a group of relation definitions with
some additional information and is therefore refered as group in the specified format.
The relations can then be used by the students to formulate the there statements.
Lets assume we want to create a dataset of employees of a company.
group: bank example
every header field starts with the name of the field and is followed by a colon for single line values.
The next (optional) header field we should specify is the description. It should contain information like
who is maintaining this dataset or where to find additional information.In our example we add a description that takes more than a single line and therefore we enclose the value in double brackets instead of using the colon.
group: bank example
description[[ the data for this dataset was generated using <http://www.generatedata.com/>
* the relation _Customers_ contains basic information about the customers of the bank.
The relation _Accounts_ contains the basic information of a single account.
Note that a customer can have any number of accounts.
* the relation _PremiumCustomers_ contains
the customer-ids of all customers with a total balance over 1000
]]
The next step is to actually add the relations the students can use for their queries.
The relation definitions are use the relational algebra syntax that can be used in this tool.
Every relation is defined by a single variable assignment where the name of the variable is used as the
relations name and the result of the expression defines the relation.
The relalg expression can use all features that can be used in the tool and can also use other relations
defined within the same dataset.
Note that the name of the relation is used as the qualifier of each attribute/column.
For the relation Customers
relation we use the inline
relation syntax as the most basic method to define the relation and can also be edited using the
relation editor which is a simple spread-sheet like
editor.
The inline relations in combination with the editor should be very easy to use if you enter the data
directly or if you have them as a csv/spread-sheet file.
group: bank example
description[[ the data for this dataset was generated using <http://www.generatedata.com/>
* the relation _Customers_ contains basic information about the customers of the bank.
* the relation _Accounts_ contains the basic information of a single account. Note that a customer can
have any number of accounts.
* the relation _PremiumCustomers_ contains the customer-ids of all customers with a total balance over
1000
]]
Customers = { cid firstname lastname
1 Dexter Simpson
2 Kaseem Gallagher
3 Kuame Hamilton
4 Robert Thompson
5 Rhiannon Valentine
6 Calvin Mays
}
Accounts = {
aid, cid, balance:number
1, 1, 66
2, 1, -304
3, 2, 272
4, 3, 3472
5, 4, 975.7
6, 4, 93
7, 5, 534
8, 5, -75.5
}
As we can see the name of the relations are defined by the assignments.
The inline-relations are enclosed by curly brackets and contain the names of the attributes/columns in
the first line and then a tuple/row per line where the values are simply separated by whitespace. You
can also other seperators and can define the types explicitly as we can see at the Accounts relation.
For further information can be found at inline relation
description.
Note that, unlike the variables used in a query, the definition of a new relation implicitly sets the
attribute qualifier to the name of the relation.
So the schema of the account relation is (Accounts.aid, Accounts.cid, Accounts.balance)
.
This allows each attribute to be accessible with this name.
The last relation we need to add in this example is the relation containing the banks premium Customers.
They are specified by using the other two relations in a simple relational algebra statement that
selects all customers with a total balance over 1000.
group: bank example
description[[ the data for this dataset was generated using <http://www.generatedata.com/>
* the relation _Customers_ contains basic information about the customers of the bank.
* the relation _Accounts_ contains the basic information of a single account. Note that a customer can
have any number of accounts.
* the relation _PremiumCustomers_ contains the customer-ids of all customers with a total balance over
1000
]]
Customers = { cid firstname lastname
1 Dexter Simpson
2 Kaseem Gallagher
3 Kuame Hamilton
4 Robert Thompson
5 Rhiannon Valentine
6 Calvin Mays
}
Accounts = {
aid, cid, balance:number
1, 1, 66
2, 1, -304
3, 2, 272
4, 3, 3472
5, 4, 975.7
6, 4, 93
7, 5, 534
8, 5, -75.5
}
PremiumCustomers =
pi cid (
sigma sum > 1000 (
gamma cid; sum(balance)->sum (
Accounts
join Customers
)
)
)
We have now seen how to define a dataset with its header containing the name and the description followed by relational algebra assignments defining the relations of the dataset.
We can paste this definition directly in the Group Editor to load and use it.
In the next section we want to look at how we can publish this definition so that we give our students a
single url or id to directly load the dataset.
If we want to share the definition of a dataset with other people we could simply give them the definition and they load it using the group editor, but that would not be that practical for most cases.
The simpler solution (for the users) is to publish the definition as a GitHub Gist and share the ID of the gist with others.
Just create a gist with the definition as its content (the filename does not matter) and publish it. The ID of the Gist can now be found at the top of the page as gist:xxxxxxxxxxxx or in the url after the last slash.
This ID can then be shared and loaded in the interface or the calculator can be called directly with a
specific ID by using using the parameter ?data=gist:xxxxxxxxxxxxxx
.
For example the simple bank definition of this tutorial has been published as a Gist with the ID
2cfb981fbc5676182d64 and can therefore be loaded directly by appending ?data=gist:2cfb981fbc5676182d64
to the url of the calculator.
syntax | NAME = EXPRESSION |
---|
Defines a new local variable with the name NAME; its content is defined by EXPRESSION
The name of the new relation must be unique.
The definition has to be executed with the statement.
TestA = π a,b R
TestB = σ d > 0 S
-- statement using the variable
TestA join TestB
An assignment (= definition of a variable) is no valid relational algebra expression on its own. If you miss the acutal query a error is thrown (Error: only assignments found; query is missing).
-- this is the definition of the variable
Test = π Customer.firstname, surname ( Customer )
-- this is the actual query/statement using the variable
Test
The defined variable can be used like the assigned expression could be used because
every usage of the variable gets replaced with its definition before the query gets executed.
This also means that the variable-name has no influence on the schema of the expression
and the names of the attributes/columns are not affected by assignment:
X = R
X join S
The attributes of the relation R are only accessible with its original names (R.a, R.b, ..),
and are not affected by the assignment.
There is a known problem when the last assignment ends with a natural join and the query consists of a single relation:
A = S join R
A -- this is the query
The statement is ambiguous and the parser interprets it as A = (S join R A)
where R is interpreted as a column argument for the theta-join and therefore detects a
cyclic usage of the variable A.
Solution: To get the expected behaviour you have to set braces around the assigned expression
like A = (S join R)
A
syntax | -- COMMENT_TEXTEXPRESSION |
---|
the '--' must be followed by at least one whitespace charater! inserts a comment; its text goes until the end of the line
comments are recognized as whitespace
Test =
-- This is the expression that is assigned to Test:
π Customer.firstname, surname ( Customer )
syntax | RELATION_NAME |
---|
Uses a pre defined relation with the name RELATION_NAME
The code completion only works for this relations.
( Customers ) cross join ( Accounts )
syntax | /* COMMENT_TEXT */
EXPRESSION |
---|
inserts a comment that can span multiple lines
comments are recognized as whitespace
/* This is
a very
very
long comment */
Test = π Customer.firstname, surname ( Customer )
syntax | {COLUMN_NAME_1:COLUMN_TYPE_1 ...
ROW_1
ROW_2
...
}
|
---|
The inline-relation is a temporary relation that can be defined directly in the statement. It is only valid in the defining statement
Every inline-relation is a valid expression and thus can be used at any position a EXPRESSION is expected.
The inline-relation is defined by a header, that specifies the schema of the relation and the rows containing the values and is surrounded by curly brackets.
QUALIFIER.COLUMN_NAME:COLUMN_TYPE
separated by any whitespace,
comma or semicolon.
The QUALIFIER is optional. Also the COLUMN_TYPE can be omitted if the type is well
defined by the values of that column. The first non null value of a column defines its type.
The rows of the relation are defined by a list of values per row with the type of the
corresponding column. The values are separated by whitespace comma or semicolon.
Simple strings only containing letters, numbers, hyphens, underlines, dots or periods
([0-9a-zA-Z\-_\.]+) can be written without single quotes. NULL and null are
constant values. If null, true or false should be used as string they have te be quoted.
More complex strings must be surrounded by single quotes: 'content' or content
but '' or 'long content containing spaces and special characters like !' or 'null'.
Dates are written in ISO-format: YYYY-MM-DD without single quotes
A null-value can be written as null or NULL (without single quotes).
Numbers could be integers in the form (-?[0-9]+) or floats in the form (-?[0-9]+\.[0-9]+).
Numbers in single quotes are recognized as numbers if the column type is defined as number
or has been detected to be number from a previous value; otherwise it will be a string value..
A boolean value is denoted as either true or false (case insensitive).
The header and rows can be indented if needed.
-- type for column b is defined by the first value
rho A {
a:number, b
1, 2
3, 4
}
cross join
{
a:string X.b:date c:number
Alpha 1970-01-01 1
'Beta 2' 1970-01-02 3
'' 1970-01-03 4
}
A valid relational algebra expression is built by connecting relation-name or inline-relation as atoms with the defined unary and binary operators.
So a relational algebra expression is recursively defined as follows:FUNCTION ARGUMENT ( CHILD_EXPRESSION )
symbol | π |
---|---|
alternative syntax | pi |
example |
pi a, b ( R )
|
The argument is a subset of columns of the schema of the CHILD_EXPRESSION or a value expression
π Customer.firstname, surname ( Customer )
pi c.id, [1] ( ρ c ( Customer ) )
pi c.id, lower(username)->user, concat(firstname, concat(' ', lastname))->fullname (
ρ c ( Customer )
)
rownum()
expression can be used to get the same information. And it can also be used
directly in the boolean condition of a selection or join.
pi firstname, lastname
sigma rownum() <= 5
tau countOrders desc
Customer
symbol | σ |
---|---|
alternative syntax | pi |
example |
sigma a > 2 ( R )
|
The argument is a boolean expression that each row of CHILD_EXPRESSION is checked on
σ firstname = 'Bob' or firstname = 'Alice' ( Customer )
σ (id > 10 and id < 100) or id = 42 ( Customer )
σ mod(length(firstname),2) = 0 ( Customer )
symbol | ρ |
---|---|
alternative syntax | rho |
example |
( R ) join R.a = X.b (rho X ( R ))
|
π a.id, a.firstname ( ρ a ( Customer ) )
symbol | ρ |
---|---|
alternative syntax | rho |
example |
the old and the new column names in a list (see example) "←" can be substituted with "<-" pi x, b rho a->x {a, b
1, 2
3, 4
}
|
ρ myId←id, foobar←firstname (π id, firstname ( Customer ) )
symbol | τ |
---|---|
alternative syntax | tau |
example |
tau a asc, b desc ( R )
|
τ [1], firstname desc (π id, firstname ( Customer ) )
symbol | γ |
---|---|
alternative syntax | gamma |
example |
gamma a, count(*)->x ( R )
|
γ a, b ; sum(c)->x ( Customer )
If no grouping columns are provided the entire relation is the group.
number | string | date | |
---|---|---|---|
COUNT( * ) | yes | yes | yes |
COUNT( column ) | yes | yes | yes |
MIN( column ) | yes | yes | yes |
MAX( column ) | yes | yes | yes |
SUM( column ) | yes | no | no |
AVG( column ) | yes | no | no |
( CHILD_EXPRESSION ) FUNCTION ARGUMENT (
CHILD_EXPRESSION )
symbol | ∩ |
---|---|
alternative syntax | intersect |
( Customer ) ∩ ( Customer )
symbol | ∪ |
---|---|
alternative syntax | union |
( Customer ) ∪ ( Customer )
symbol | ÷ |
---|---|
alternative syntax | / |
( Customer ) ÷ ( Customer )
- | ∪ |
---|---|
alternative syntax | \ except |
( pi firstname ( Customer ) ) - ( rho test<-lastname (
pi lastname ( Customer )
) )
symbol | ⨯ |
---|---|
alternative syntax | cross join |
symbol | ⋈ |
---|---|
alternative syntax | join inner join |
Special case:
The name of a single boolean column (like R join a S
) can not be used directly
as a join condition due to ambiguities in the relational algebra syntax.
The column must either be specified with its qualifier (R join R.a S
) or wrapped in
parentheses (R join (a) S
).
This is not necessary for more complex boolean expressions. The problem is only that the single
column name
can not be distinguished from a relation name: the expression X=R join S X
could be
interpreted as A=(R join S A)
instead of A=(R join S) A
.
symbol | ⋈ |
---|---|
alternative syntax | join natural join |
ρ a ( Customer )
⋈ a.name < b.name ( ρ b ( Customer ) )
symbol | ⟕ |
---|---|
alternative syntax | left outer join left join |
symbol | ⟖ |
---|---|
alternative syntax | right outer join right join |
symbol | ⟗ |
---|---|
alternative syntax | full outer join |
symbol | ⋉ |
---|---|
alternative syntax | left semi join |
symbol | ⋊ |
---|---|
alternative syntax | right semi join |
symbol | ▷ |
---|---|
alternative syntax | anti semi join anti join |
The operator precedence allows to obmit most of braces.
The used precedence is shown in the table below.
All operators are left associative.
order of precedence | Operator |
---|---|
0 | relation-name, inline-relation |
1 | projection, selection, rename (columns), rename (relations), group, order by |
2 | cross product, theta join, natural join, left outer join, right outer join, full outer join, left semi-join, right semi-join, anti semi-join, division |
3 | intersection |
4 | union, subtraction |
A join B x C
((A join B) x C)
sigma true A join sigma true B
(sigma true (A)) join (sigma true (B))
rownum()
expression always represents the index of the lefthand relation
syntax | returns type | description |
---|---|---|
a AND b |
boolean | logical and |
a OR b |
boolean | logical or |
a XOR b |
boolean | logical exclusive or |
NOT b |
boolean | logical not |
a = b
a != b
a < b
a > b
a <= b
a >= b
a != b
|
boolean | compares two values of the same type |
a:string LIKE 'PATTERN' |
boolean | returns true if expression evaluating to a string a matches the pattern
given as the second operand.
The pattern has to be given as a string literal;
An underscore ( |
a:string ILIKE 'PATTERN' |
boolean |
same as LIKE but matches case-insensitive.
This is not in the SQL standard but is a PostgreSQL extension. |
a + b
a - b
a * b
a / b
a % b
|
number | arithmetic addition, subtraction, multiplication, division, modulo |
rand() |
number | returns a random number in the interval [0, 1] |
rownum() |
number |
returns the index of the current row (starting with 0).
If the function is used in a binary relational algebra expression (e.g. a join) it always represents the index of the left operand. |
length(a:string) |
number | returns the length of the string |
date(a:string) |
date | parses the given string to a date object. The string must be in the form YYYY-MM-DD |
adddate(a:date, b:number) |
date | adds the given number of days to date a |
subdate(a:date, b:number) |
date | subtracts the given number of days from date a |
now()
transaction_timestamp()
statement_timestamp() |
date | returns a timestamp representing the start of the query execution
transaction- and statement start are the very same value due to the lack of a transaction concept |
clock_timestamp() |
date | returns the current timestamp while executing the query |
year(a:date) |
number | returns the year component of a given date |
month(a:date) |
number | returns the month component of a given date as a number. (1 = Sunday, 2 = Monday, ..., 7 = Saturday) |
day(a:date)
dayofmonth(a:date) |
number | returns the day component of a given date as a number in the range 1 to 31 |
hour(a:date) |
number | returns the hour component of a given date as a number in the range 0 to 23 |
minute(a:date) |
number | returns the minute component of a given date as a number in the range 0 to 59 |
second(a:date) |
number | returns the second component of a given date as a number in the range 0 to 59 |
concat(a:string [, ...]) |
string | concatenates the given strings |
upper(a:string)
ucase(a:string) |
string | converts the given string to upper-case |
lower(a:string)
lcase(a:string) |
string | converts the given string to lower-case |
strlen(a:string) |
number | number of characters of the string |
abs(a:number) |
number | the absolute value of the given number |
add(a, b)
sub(a, b)
mul(a, b)
div(a, b)
mod(a, b)
|
number | arithmetic addition, subtraction, multiplication, division or modulo of the given numbers |
round(a)
floor(a)
ceil(a)
|
number | round to nearest integer, largest integer not greater than the argument or smallest integer not less than the argument |
coalesce(value [, ...]) |
type of value | returns the first of its arguments that is not null or null if all arguments are null. Note that all arguments must have the same datatype. |
CASE WHEN condition THEN result
[WHEN ...]
[ELSE result]
END
|
type of result | returns the first result where the condition evaluates to true. If all conditions are false the else part is executed or null is returnt if the else part is missing. Note that all results must have the same datatype. |
order of precedence | Operators |
---|---|
0 | functions, constants, columns |
1 | ! (boolean not) |
2 | - (unary minus) |
3 | *, /, % |
4 | -, + |
5 | = (comparison), >=, >, <=, <, <>, !=, LIKE, ILIKE |
6 | CASE, WHEN, THEN, ELSE |
7 | AND |
8 | XOR |
9 | OR, || |
The goal of the SQL mode of the relational algebra calculator is to provide a translation from SQL to relational algebra to show how they are related. It does not support all features a real database system like PostgreSQL or MySQL does because the goal is to provide a translation into relational algebra. This means that many features like correlated-substatements are not supported because the translation into relational algebra is not trivial and modern database systems use an extended set of operators internally that do not require a one-to-one translation into "classical" relational algebra. Therefore the learning effect for users of this tool would not be that big.
All keywords are case insensitv.
The following Synopsis is a adapted version of PostgreSQL and shows the general syntax of the supported SQL. Brackets indicate optional parts. Braces and vertical lines indicate that one of the alternatives has to be chosen. Dots mean that the preceding element can be repeated.
[ WITH with_query [, ...] ]
SELECT [ DISTINCT ]
* | expression [ [ AS ] output_name ] [, ...]
FROM from_item [, ...]
[ WHERE condition ]
[ GROUP BY column [, ...] ]
[ HAVING condition ]
[ { UNION | INTERSECT | EXCEPT } [ ALL | DISTINCT ] select ]
[ ORDER BY column [ ASC | DESC ] [, ...] ]
[ LIMIT { count | ALL } ]
[ OFFSET start [ ROW | ROWS ] ]
[ FETCH { FIRST | NEXT } [ count ] { ROW | ROWS } ONLY ]
where from_item can be one of:
table_name [ AS alias ]
with_query_name [ AS alias ]
( select ) AS alias
from_item CROSS JOIN from_item
from_item NATURAL JOIN from_item
from_item [ INNER ] JOIN from_item ON join_condition
from_item [ INNER ] JOIN from_item NATURAL
from_item [ INNER ] JOIN from_item USING ( join_column [, ...] )
from_item
from_item { LEFT | RIGHT | FULL } [ OUTER ] JOIN ON join_condition
from_item
from_item { LEFT | RIGHT | FULL } [ OUTER ] JOIN NATURAL from_item
from_item { LEFT | RIGHT | FULL } [ OUTER ] JOIN USING ( join_column [, ...] ) from_item
and with_query is:
with_query_name AS ( select )
The SQL statement is translated directly into relational algebra. To understand some of the effects of the tool it might be helpful to understand the steps of the translation process. As mentioned before, real database systems might take a different (more complex) aproach but this should help to get an idea of how SQL could be translated into "classical" relational algebra.
The following list shows the translation from SQL into relational algebra starting with the inner most relational algebra expression at the top.
,
is a cross join)
The direct translation into relational algebra with implicit duplication elimination requires the distinct
keyword to be equivalent. A warning is shown if you omit it.
The select-clause is translated into up to two relalg operators.
select * ...
where no changes are made to the
schema of the result. Therefore no projection is needed. You can use the optional table-alias-prefix
if the columns of a single table/relation should be returned only: select distinct
R.* from R inner join S
select a, b from ...
) then
a single projection is used.
as foo
.
e.g. select a as foo from ...
The allowed expressions are the same as for the projection.
So it can be either the name of the column (with optional renaming using as
) or a complex
value expression. In the latter case a output-name must be given using
as
because the tool requires every column to have a name.
In its simplest form the from-clause holds a single table/relation name that is used directly in the
relalg statement. If the optional table-alias is specified with table as foo
this is
reflected by wrapping the relation in a rename-relation
operator with the given output-name.
select distinct x.a, x.b
from R as x
Joins can be expressed using the ANSI join syntax
select distinct *
from A, B
inner join C on ...
inner join D natural
natural join E
left outer join F on ...
left outer join G natural
right outer join H on ...
right outer join I natural
full outer join J on ...
full outer join K natural
where ...
The comma is part of the old join syntax and is translated into a cross join.
select distinct *
from R, S as s, T
where s.a = R.a
Instead of the name of relation a non-correlated substatement can be used. A table alias must be provided
with (...) as foo
.
A non-correlated substatement can be directly translated into relational algebra by just translate
the sub-statement into relational algebra and use the resulting operator tree instead of the relation.
Non-correlated means that it must not reference/use any columns of tables defined in the outer scope.
This limitation is intentionally because the translation into relational algebra is not trivial and
modern database systems use an extended set of operators internally that do not require a one-to-one
translation into "classical" relational algebra. Therefore the learning effect for users of this tool
would not be that big.
select distinct *
from R, (select * from S where a > 0) as x
where x.a = R.a
The boolean condition in the where-clause can be any expression evaluating to boolean.
The where clause is directly translated to an relational algebra selection with the very same condition. This selection is applied after joining relations of the from-clause therefore has to use the original column names.
Subquery Expressions like EXISTS
, IN
, ANY/SOME
or ALL
are not supported because their translation into relational algebra is not trivial and
modern database systems use an extended set of operators internally that do not require a one-to-one
translation into "classical" relational algebra. Therefore the learning effect for users of this tool
would not be that big.
The GROUP-BY-clause takes a list of column names only argument.
The GROUP-BY-clause is directly translated to the relational algebra group-by operation and is executed directly after the selection built from the WHERE-clause and before the projection/renaming build from the SELECT-clause. Therefore the column names that can be used are the ones available after all joining all tables.
The aggregations used in the relational algebra group-by
operation are taken from the SELECT-clause and an output-name must be given using as
because the tool requires every column to have a name.
If no aggregations are present in the SELECT-clause a projection is used instead of the group-by operation because sigma without aggregation has the very same effect.
Every non-aggregation-column in the SELECT-clause must be present in the group by clause because the would not be available after the grouping.
The HAVING-Clause represents an optional relational algebra selection. The boolean condition can be any expression evaluating to boolean.
The resulting selection is executed directly after the relational algebra group-by operation.
Unlike PostgreSQL the HAVING-clause is only allowed when either a aggregation or grouping is present.
Order by takes a list of column names or indices of columns (starting with 1) as its argument.
It is directly translated to the extended relational algebra operation order by (tau).
The LIMIT-clause can be either specified with the LIMIT-OFFSET syntax used by PostgreSQL and MySQL or the FETCH-FIRST syntax introduced in SQL:2008.
It is translated into a relational algebra selection using the
rownum()
-function to limit the number of rows returned.
The Set-Operators UNION, INTERSECT and EXCEPT directly map to the relational algebra operators union, intersection and subtraction.
The keyword DISTINCT
is optional because it represents the default behavior. The keyword
ALL
is ignored and a warning is shown because the targeted relational algebra has a
implicit elimination duplicate rows.
Parentheses can be used to create more complex statements:
(
select distinct * from S
union
select distinct * from T
)
except select distinct * from T
order by 1
limit 1
The WITH-clause (also known as common table expressions) provides a way to define subqueries for single or multiple use in a statement. This can be thought as defining a temporary table for that query in SQL terminology or creating variables in relational algebra. Recursive evaluation is not supported.
Each subquery can be referenced by the name from the WITH-clause. The subquery is automatically renamed to the name used in the WITH-clause.
Value expressions are used for boolean expressions for WHERE- and HAVING-clause, the boolean conditions of joins and calculated values in the SELECT-clause. The type of a expression is either string, number, date or boolean and is determined by the used operations and columns.
The supported functions and operations are the same for SQL and relational algebra: value expression
This work by
Johannes Kessler
is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.