Question
We need to remove the duplicate rows of data from our database.
Data
For this exercise, I have created dummy data. The SQL file can be found here.
Solution
Let’s first explore our data
SELECT *
FROM Employee;
Employee_ID | FirstName | LastName | Phone | |
1 | Adam | Owens | 1112223334 | adamowens@gmail.com |
2 | Michelle | Wilis | 2223334445 | michellewilis@gmail.com |
3 | Natasha | Lee | 3334445556 | natashalee@gmail.com |
4 | Riley | Jones | 4445556667 | rileyjones@gmail.com |
5 | Adam | Owens | 1112223334 | adamowens@gmail.com |
6 | Natasha | Lee | 3334445556 | natashalee@gmail.com |
We can see that there are two rows (ID = 1 & 5, 3 & 6) with the same data in all the columns except Employee ID. Let’s figure out a way to remove those rows. We can break the query into two parts.
- Write a query that identifies the duplicate rows by grouping the duplicating columns.
- We will use ROW_NUMBER() for this.
- We can then use the above query as a CTE to join our main table and remove the duplicate data.
Let’s write a query to determine the duplicate data
SELECT
*,
ROW_NUMBER() OVER( PARTITION BY FirstName, LastName, Phone, Email ORDER BY Employee_ID DESC) AS row_num
FROM Employee;
Employee_ID | FirstName | LastName | Phone | row_num | |
5 | Adam | Owens | 1112223334 | adamowens@gmail.com | 1 |
1 | Adam | Owens | 1112223334 | adamowens@gmail.com | 2 |
2 | Michelle | Wilis | 2223334445 | michellewilis@gmail.com | 1 |
6 | Natasha | Lee | 3334445556 | natashalee@gmail.com | 1 |
3 | Natasha | Lee | 3334445556 | natashalee@gmail.com | 2 |
4 | Riley | Jones | 4445556667 | rileyjones@gmail.com | 1 |
- ROW_NUMBER outputs 1 for the first unique value and then increments by 1 for a duplicate value.
- In our query, we partition the data by FirstName, LastName, Phone, and Email columns since those should tell us which row is duplicated.
- We then order our Employee_ID in descending order to get the latest ID at the top.
We can now simply use the above query as a CTE and remove the rows where row_num is greater than 1.
-- Let's remove the duplicate data
WITH CTE AS (
SELECT
*,
ROW_NUMBER() OVER( PARTITION BY FirstName, LastName, Phone, Email ORDER BY Employee_ID DESC) AS row_num
FROM Employee
)
DELETE FROM CTE
WHERE row_num > 1;
-- Let's view the data again
SELECT *
FROM Employee;
Employee_ID | FirstName | LastName | Phone | |
2 | Michelle | Wilis | 2223334445 | michellewilis@gmail.com |
4 | Riley | Jones | 4445556667 | rileyjones@gmail.com |
5 | Adam | Owens | 1112223334 | adamowens@gmail.com |
6 | Natasha | Lee | 3334445556 | natashalee@gmail.com |
Hopefully, this series helps you with the most common questions asked in a SQL Interview. Feel free to drop any questions you have encountered and don’t find an article in the series.
You can find the SQL file here.