A Complete SQL Tutorial: Great for Data Analysts
Learning SQL sometimes is a bit hard, especially if you starting off and you’re not too sure what exactly it does. Now, if you want to career in data analytics or data science or even a career that somehow deals with databases, you are going to have to know sql. In this post. I’m going to go over some foundational basics of sql.
We’re going to figure out why should you use sql? What is sql? When should you use it? Even No SQL versus sql. And we’re also going to be building some SQL queries.
Introduction to SQL
SQL stands for Structured Query Language and was invented in the 1970s- but you don’t need to know that — do you know why? Because not once in my career did that ever help me, in any way. SQL pronounce Sequel is a language thats designed to talk to databases. Whenever you think SQL, think databases.
And when I say databases, I mean storage for data — or tables. You can store your data with different providers, such as SQL Server, Postgres, Amazon Redshift/Aurora, Snowflake, Google BigQuery, Azure. They all use SQL or some version of SQL.
So let’s take it simply and I’ll explain to you why we need SQL. I have a table that has 10 million rows and a few columns. It’s a table that list’s all my online activity ever. Every time I searched something, everytime I googled something, every message I sent, every profile I viewed etc.
So this is my table that’s stored in some cloud database and there’s 10 million rows. What If I wanted to find out how much time I spent online in September?
I would need to ask the data and it would be something like this:
Q: Hey, data. Can you tell me how much time I spent online in September?
A: You spent 1.2million seconds in September.
And then I say,
Q: Okay data Can you tell me how much time I spent online in september in Hours not seconds?
A: You spent 360 hours online in September.
Now those questions that I’ve asked the data, were queries. And that would be how queries are created.
Here you can see that it gave me answers, without me doing to much of work. I just asked it a few sentences. That’s how queries are.
The Other Uses of SQL
Now, we don’t only need to ask the table questions. SQL can do other things and let’s do a deep dive of it.
DQL
The most common thing SQL does is Query. This means it fetches data from your database and any other conditions that you wanted/
So me saying:
Q: Hey, data. Can you tell me how much time I spent online in September?
Means I just want to see the time and to only show data in September. When we fetch data from a database, this is known as Data Query Language (DQL). And it only uses one command called select.
Out of interest, it will look like this:
select sum(time_spent_seconds)
from myonlineactivity
where monthyear = 'september2022'
That’s DQL. But there are other uses of SQL, not just querying things.
DDL
I could do things to the actual tables. Maybe I want to create a table- where I want to have a list of all my devices I’ve used.
I would use the CREATE command
CREATE TABLE devices (
DeviceID int,
Name varchar(255),
Owner varchar(255),
);
Or maybe I want to to remove tables or even databases, I would use the DROP command:
Drop database activity
Or maybe I want to add a column that’s a varchar, I would use the ALTER command
Alter table devices add storebought varchar;
Or maybe I want to empty out a table and start again, you can use the TRUNCATE command
TRUNCATE table devices;
This is known as DDL. It stands fro Data Definition Language and helps you to define the database structure or schema. The 4 common commands are CREATE, DROP, ALTER< TRUNCATE
DML
Now, what if I wanted to Insert records or rows? Maybe I wanted to add one more device to my Table. I would use the insert a record like this
INSERT INTO devices (DeviceID, Name, Owner) VALUES ('4', 'Airpods', 'Dee');
Maybe I made a mistake, maybe I need to update DeviceID number 4 to change the owner to Jon.
I can use the UPDATE statement:
UPDATE devices
SET Owner = 'Jon'
WHERE DeviceID = 4;
and if I ever need to delete a row or some rows, I can use the DELETE command. So maybe I just want to show my devices and not Jon’s devices, so I would do this
DELETE FROM devices
WHERE Owner = 'Jon';
Other ones
Now there are other commands such as the ability to Grant or Revoke certain permissions to specific users — these are DCL commands which are useful to give “rights & permissions.”
Transaction control language or TCL commands deal with the transaction within the database. Can include using COMMIT to save all the transactions to the database or ROLLBACK to transactions that have not already been saved to the database.
But as a data analyst the common ones would be DQL and DDL and DML. So where you can query the data and maybe alter the data.
Why is SQL Beneficial
- SQL is high performance.
SQL’s performance is good even with heavy workloads and high usage. It’s proven to work well with complex databases. SQL can support a lot of data and handle tons of transactions at the same time.
- SQL is accessible.
SQL is compatible with most databases, such as Microsoft SQL Server, Oracle Database, MySQL, Azure, BigQuery, Amazon Redshift, Snowflake. Most relational database management systems have SQL support, and they come with additional features as well. You can also use SQL to create application extensions for procedural programming.
- SQL is highly scalable.
It’s incredibly easy to create new tables and carry over existing tables to new databases. They can be easily dropped or created at your convenience.
- SQL has great security features.
Permissions are easy to manage with SQL. It’s very simple to provide and verify your permissions, ensuring the security of your data.
- Requires little coding knowledge.
If you just want to retrieve and analyze data, it’s easy to learn. All you need to do is learn the select, insert, update, and into functions and their syntax.
Relational Databases
Now I wanted to quickly touch on relational databases. SQL is a language used to communicate with data stored in a relational database management system (RDMS). This is just a fancy way of saying a database that stores a bunch of tables and you can link these tables to each other. The link is referred as the relationship.
Customer Table
Order Table
Relational database systems use a model that organizes data into tables of rows (also called records) and columns (also called attributes or fields).
Which basically means it uses a table.
These tables can be linked or related using keys. Each row in a table is identified using a unique key, called a primary key.
So in this example, the Customer Table has a primary key called CustomerID. Why? Because CustomerID is unique, each row in the customer table represents 1 unique customer so each CustomerID represents 1 unique customer.
This primary key can be added to another table, becoming a foreign key. So let’s look at the Order Table. The Primary key here is OrderID. Because Each row in the order table represents a unique order — so the OrderID column will be a primary key — it will be unique. But the CustomerID is a foreign key — because we can have the same customer ordering more than once, so it won’t be unique in the Order Table. You can see here that CU001 bought a Notebook and a Pencil to separate times.
The primary/foreign key relationship forms the basis of the way relational databases work.
Now you get something like NoSQL
Let’s talk about NoSQL databases. NoSQL Database is a non-relational Data Management System, that does not require a fixed schema. Meaning to keys, it’s not in a tabular structure. It stores data in JSON format. Whenever you think of JSON format think this.
{
"name" : "Dee",
"occupation" : "Analyst",
"location" : "South Africa",
"hobbies": ["fine dining", "drawing", "breaking python scripts"]
}
So when you retrieve something from a sql database it looks like this:
But when you retrieve something from a NO database it looks like this:
{
"CustomerID" : "CU001",
"FirstName" : "Jane",
"LastName" : "Doe",
"City" : "Paris",
"Age" : 31
}
{
"CustomerID" : "CU002",
"FirstName" : "Sam",
"LastName" : "Smith",
"City" : "Berlin",
"Age" : 21
}
Benefits of NoSQL — I think you can see it here is it’s flexible especially if you don’t have data that is related. It’s also good for huge volumes of data.
If you’ve heard of MongoDB — MongoDB is NOSQL, DynamoDB as well and there are others.
In terms of popularity and which one is more used — I think SQL is still king here in terms of data analytics and I don’t think that would go away soon but also NoSQL usage has been growing and has it’s place in the market.
SQL Tutorial — Building a SQL Query
So I am going to head over to w3schools sql editor and show you some sql queries to get you started off. This is just going to be a simple tutorial to cover the basics on building a SQL Query.
https://www.w3schools.com/sql/trysql.asp?filename=trysql_asc
So in this editor we have some tables in our database. And usually when you are in a SQL editor like mySQL workbench, Snowflake etc so you see a list of tables available to you. Now usually these tables will be grouped — so in one group you can have raw tables that need cleaning, in another group you can have your cleaned tables just for reporting. Those groups are called schemas.
Now what if we want to view a table?
Select Statement
We use the select statement.
select *
from orders
This statement shows all the columns and depending on your editor the first x amount of rows
Now what if you don’t want to see all the columns? Maybe we just want to see OrderID and OrderDate
select OrderID,
OrderDate
from orders
You separate fields by commas, except the last field just before the FROM command.
select OrderID,
OrderDate,
ShipperID
from orders
Where Statement
Now, let’s go to the customers table
select * from customers
And let’s see what customers are in london
select *
from customers
where City = "London"
Now remember, letters or characters and dates get quotation marks, numbers don’t need quotes.
You can also find specific phrases that occur such as any city that has the letters wa in it. The percentages on either end means that any combination of words or integers can happen on the left and right
select *
from customers
where City like "%wa%"
If we remove the first percent sign, it only looks for cities that start with wa
select *
from customers
where City like "wa%"
likewise, removing the last % sign will look for cities that end with wa
select *
from customers
where City like "%wa"
We can also do multiple where statement
select *
from orders
where orderdate >= "1996-09-10" and shipperid = 2 and employeeid = 4
Joining: The Theory
Now because this is a relational database we can join tables to get more information. Such as taking the orders table and joining customers to it. Now I’m just going to do a bit of theory to help you understand sql joins
Let’s say you have two tables. One is an order table- which shows the orders a customer has made. The other is a product table- which gives you details of the products. And you want to join these two.
Table A: Order Data
Table B: Product Data
There are 4 main joins you can use in sql.
Firstly before we start, notice that we can link the two tables by product id as it’s present in both.
Inner Join
Inner join produces only the set of records that match in both Table A and Table B.
So this is what you’ll see on Tableau as your final table
Because P1, P2, P3 exist in both the tables so that will only pull through.
Left Join
Left join only selects data starting from the left table and matching rows in the right table.
P4 and P5 existed on the orders table but did not exist in the product table. So the final table has NULLs for product data.
P6 existed on your product table but did not exist on the orders table. The left join will bring in all rows from the left table and the matching rows from the right table. So since P6 isn’t in the orders table it won’t be brought in to your final table.
Right Join
The Right Join is the reverse of left. Only selects data starting from the right table and matching rows in the left table.
Right join selects data starting from the right table and matching rows in the left table.
P6 existed on the product table but did not exist in the orders table. So the final table has NULLs for orders data.
P4 and P5 existed on the order table but did not exist on the product table. The right join will bring in all rows from the right table and the matching rows from the left table. So since P4 and P5 isn’t in the product table it won’t be brought in to your final table.
Full Outer
A full outer join has a final set that contains all rows from both left and right tables, with the matching rows from both sides where available. In case there is no match, the missing side will have NULL values.
Now full outers aren’t always available depending on the database.
Joining: The Practice
Now let’s have a look at orders
select *
from orders
And with that let’s bring in customer. Now when we’re joining, use your most granular table as the primary table. So we know orders will be the primary table because customerid is a foreign key — we can have more than one customerid showing up in the orders table. Because one customer can order multiple times. So since orderid occurs uniquely, this is our primary table.
select *
from orders
left join customers
on orders.CustomerID = customers.CustomerID
So a left join will bring in all orders but only show the customers that match with the orderid
Always after from introduce your left join. and then you need to decide how to link the two tables. Start with the table you referenced first and we’re telling the database from the orders table, use customerid and join it to the customerid in the customers table.
Now if we do an inner join, this will only bring in exact match if the customerid shows up in orders and customers.
select *
from orders
inner join customers
on orders.CustomerID = customers.CustomerID
A right join won’t make sense, because if our customer table becomes our primary table we are saying every customer in the customer table has made an order. Which doesn’t make sense. We definitely know every order would have a customer.
You can implement a where statement but remember once you join two tables it good practice to tell sql which column you’re referencing belongs to which table especially if the column appears on both
select *
from orders
inner join customers
on orders.CustomerID = 65 = customers.CustomerID
where orders.CustomerID = 65
you can get away with something like employee id as it only show up in the orders table.
select *
from orders
inner join customers
on orders.CustomerID = 65 = customers.CustomerID
where EmployeeID = 8
You can add multiple joins
select *
from orders
left join customers
on orders.CustomerID = customers.CustomerID
left join orderdetails
on orders.orderid = orderdetails.orderid
Aggregating
You can also perform calculations. An aggregate function in SQL performs a calculation on multiple values and returns a single value. You can also do count, average, min, max
select sum(quantity) as quantitysum
from orders
left join customers
on orders.CustomerID = customers.CustomerID
left join orderdetails
on orders.orderid = orderdetails.orderid
You can also do avg
select avg(quantity) as quantityavg
from orders
left join customers
on orders.CustomerID = customers.CustomerID
left join orderdetails
on orders.orderid = orderdetails.orderid
min/max
select min(quantity) as quantitymin
from orders
left join customers
on orders.CustomerID = customers.CustomerID
left join orderdetails
on orders.orderid = orderdetails.orderid
count
select min(quantity) as quantitymin
from orders
left join customers
on orders.CustomerID = customers.CustomerID
left join orderdetails
on orders.orderid = orderdetails.orderid
count distinct
select count (distinct orders.customerid) as quantitymin
from orders
left join customers
on orders.CustomerID = customers.CustomerID
left join orderdetails
on orders.orderid = orderdetails.orderid
Group by
group by
If you want the quantity per customer
select orders.customerid, sum(quantity) as quantitysum
from orders
left join customers
on orders.CustomerID = customers.CustomerID
left join orderdetails
on orders.orderid = orderdetails.orderid
group by orders.customerid
or quantity per customer and employee
select orders.customerid, orders.employeeid, sum(quantity) as quantitysum
from orders
left join customers
on orders.CustomerID = customers.CustomerID
left join orderdetails
on orders.orderid = orderdetails.orderid
group by orders.customerid, orders.employeeid
Order By
Using order by will order your table based on a column
select orders.customerid, orders.employeeid, sum(quantity) as quantitysum
from orders
left join customers
on orders.CustomerID = customers.CustomerID
left join orderdetails
on orders.orderid = orderdetails.orderid
group by orders.customerid, orders.employeeid
order by orders.customerid desc
Limit
Limit will take the first x
select orders.customerid, orders.employeeid, sum(quantity) as quantitysum
from orders
left join customers
on orders.CustomerID = customers.CustomerID
left join orderdetails
on orders.orderid = orderdetails.orderid
group by orders.customerid, orders.employeeid
order by orders.customerid desc
limit 3
Breaking down the Body of a SQL Query
So in a sql query you have
- Essential part defines the result and what data you are pulling from,
- Optional part allows you to change the data by filtering, aggregating, sorting or limiting your result
- Select Everything after the select keyword represents the output of your query. It’s the end result.
- From The data you’re manipulating. It can be a table or the result of another query, usually called sub-query or inner query.
- Where Used to filter the data and to specify the set of conditions the data needs to comply to, in order for you to produce an output.
- Group by Indicates you’re aggregating the data, e.g., counting or summing values.
- Order by How you want to sort the results.
- Limit Indicates you only want to display a certain number of results.
Conclusion
That’s a brief overview of SQL and how to use it. I hope this helps. If you enjoy content like this, follow my YouTube channel