Joins in PySpark: Let’s understand how to join multiple data using PySpark

Spread the love

In this article, I will go over all the various types of joins and we can implement them in PySpark. I won’t be going deep into what joins are, this article is more focused on the implementation of joins using PySpark.

Importing Data

The files used in this article can be downloaded from here.

BroadcastLogs_2018_Q3_M8.csv
ReferenceTables.zip
- LogIdentifier.csv

Let’s import our main CSV data:

logs = spark.read.csv(
    './data/broadcast_logs/BroadcastLogs_2018_Q3_m8.csv',
    sep='|',
    header=True,
    inferSchema=True,
    timestampFormat="yyyy-MM-dd"
)
>>> logs.printSchema()
root
 |-- BroadcastLogID: integer (nullable = true)
 |-- LogServiceID: integer (nullable = true)
 |-- LogDate: timestamp (nullable = true)
 |-- SequenceNO: integer (nullable = true)
 |-- AudienceTargetAgeID: integer (nullable = true)
 |-- AudienceTargetEthnicID: integer (nullable = true)
 |-- CategoryID: integer (nullable = true)
 |-- ClosedCaptionID: integer (nullable = true)
 |-- CountryOfOriginID: integer (nullable = true)
 |-- DubDramaCreditID: integer (nullable = true)
 |-- EthnicProgramID: integer (nullable = true)
 |-- ProductionSourceID: integer (nullable = true)
 |-- ProgramClassID: integer (nullable = true)
 |-- FilmClassificationID: integer (nullable = true)
 |-- ExhibitionID: integer (nullable = true)
 |-- Duration: string (nullable = true)
 |-- EndTime: string (nullable = true)
 |-- LogEntryDate: timestamp (nullable = true)
 |-- ProductionNO: string (nullable = true)
 |-- ProgramTitle: string (nullable = true)
 |-- StartTime: string (nullable = true)
 |-- Subtitle: string (nullable = true)
 |-- NetworkAffiliationID: integer (nullable = true)
 |-- SpecialAttentionID: integer (nullable = true)
 |-- BroadcastOriginPointID: integer (nullable = true)
 |-- CompositionID: integer (nullable = true)
 |-- Producer1: string (nullable = true)
 |-- Producer2: string (nullable = true)
 |-- Language1: integer (nullable = true)
 |-- Language2: integer (nullable = true)

log_identifier = spark.read.csv(
    './data/broadcast_logs/ReferenceTables/LogIdentifier.csv',
    sep='|',
    header=True,
    inferSchema=True
)
>>> log_identifier.printSchema()
root
 |-- LogIdentifierID: string (nullable = true)
 |-- LogServiceID: integer (nullable = true)
 |-- PrimaryFG: integer (nullable = true)

Introduction to Joins

A join in PySpark is a lot similar to a join in SQL. There are different types of joins, namely

PySpark Join	Equivalent SQL Join
inner	INNER JOIN
outer, full, fullouter, full_outer	FULL OUTER JOIN
left, leftouter, left_outer	LEFT JOIN
right, rightouter, right_outer	RIGHT JOIN
cross
anti, leftanti, left_anti
semi, leftsemi, left_semi

The general syntax for a join is

[LEFT].join(
    [RIGHT],
    on=[PREDICATES],
    how=[METHOD]
)

Two tables, called a LEFT and a RIGHT table.
One or more PREDICATES, which are the series of conditions that determine how records between the two tables are joined.
A METHOD to indicate how we perform the join when the predicate succeeds and when it fails.

When we look at our two datasets, we can see that the two share a common key “LogServiceID”. We will use this column to join the two tables.

Inner Join

This is the most commonly used join type.
The output of this join is the matching rows from the two (left and right) tables.
Inner join is the default join in PySpark. So if you don’t specify any value for “how”, it will default to “inner”.
If a record from one table matches two records from the other table, in that case, the record will be duplicated.

# Option 1
>>> logs.join(
...     log_identifier,
...     on=logs['LogServiceID'] == log_identifier['LogServiceID'],
...     how='inner'
... ).printSchema()
root
 |-- BroadcastLogID: integer (nullable = true)
 |-- LogServiceID: integer (nullable = true)
 |-- LogDate: timestamp (nullable = true)
 |-- SequenceNO: integer (nullable = true)
 |-- AudienceTargetAgeID: integer (nullable = true)
 |-- AudienceTargetEthnicID: integer (nullable = true)
 |-- CategoryID: integer (nullable = true)
 |-- ClosedCaptionID: integer (nullable = true)
 |-- CountryOfOriginID: integer (nullable = true)
 |-- DubDramaCreditID: integer (nullable = true)
 |-- EthnicProgramID: integer (nullable = true)
 |-- ProductionSourceID: integer (nullable = true)
 |-- ProgramClassID: integer (nullable = true)
 |-- FilmClassificationID: integer (nullable = true)
 |-- ExhibitionID: integer (nullable = true)
 |-- Duration: string (nullable = true)
 |-- EndTime: string (nullable = true)
 |-- LogEntryDate: timestamp (nullable = true)
 |-- ProductionNO: string (nullable = true)
 |-- ProgramTitle: string (nullable = true)
 |-- StartTime: string (nullable = true)
 |-- Subtitle: string (nullable = true)
 |-- NetworkAffiliationID: integer (nullable = true)
 |-- SpecialAttentionID: integer (nullable = true)
 |-- BroadcastOriginPointID: integer (nullable = true)
 |-- CompositionID: integer (nullable = true)
 |-- Producer1: string (nullable = true)
 |-- Producer2: string (nullable = true)
 |-- Language1: integer (nullable = true)
 |-- Language2: integer (nullable = true)
 |-- LogIdentifierID: string (nullable = true)
 |-- LogServiceID: integer (nullable = true)
 |-- PrimaryFG: integer (nullable = true)

# Option 2
>>> logs.join(
...     log_identifier,
...     on='LogServiceID',
...     how='inner'
... ).printSchema()
root
 |-- LogServiceID: integer (nullable = true)
 |-- BroadcastLogID: integer (nullable = true)
 |-- LogDate: timestamp (nullable = true)
 |-- SequenceNO: integer (nullable = true)
 |-- AudienceTargetAgeID: integer (nullable = true)
 |-- AudienceTargetEthnicID: integer (nullable = true)
 |-- CategoryID: integer (nullable = true)
 |-- ClosedCaptionID: integer (nullable = true)
 |-- CountryOfOriginID: integer (nullable = true)
 |-- DubDramaCreditID: integer (nullable = true)
 |-- EthnicProgramID: integer (nullable = true)
 |-- ProductionSourceID: integer (nullable = true)
 |-- ProgramClassID: integer (nullable = true)
 |-- FilmClassificationID: integer (nullable = true)
 |-- ExhibitionID: integer (nullable = true)
 |-- Duration: string (nullable = true)
 |-- EndTime: string (nullable = true)
 |-- LogEntryDate: timestamp (nullable = true)
 |-- ProductionNO: string (nullable = true)
 |-- ProgramTitle: string (nullable = true)
 |-- StartTime: string (nullable = true)
 |-- Subtitle: string (nullable = true)
 |-- NetworkAffiliationID: integer (nullable = true)
 |-- SpecialAttentionID: integer (nullable = true)
 |-- BroadcastOriginPointID: integer (nullable = true)
 |-- CompositionID: integer (nullable = true)
 |-- Producer1: string (nullable = true)
 |-- Producer2: string (nullable = true)
 |-- Language1: integer (nullable = true)
 |-- Language2: integer (nullable = true)
 |-- LogIdentifierID: string (nullable = true)
 |-- PrimaryFG: integer (nullable = true)

Option 1
- Here we specify the columns which we want to use for joining the two tables.
- We can see that this causes the common column to appear twice in our final table.
- Since now we have two columns of the same name, when we select LogServiceID it will return an error. To resolve this, we can either
  - Select the column using the table name: logs.LogServiceID or logs[“LogServiceID”]
  - Use an alias for the tables and give them names “left” and “right” to make it easy for the end-user to drop the respective column.
Option 2
- If the column we are joining has the same name between the tables, we can simply specify the column instead of giving a condition.
- This causes the common column to appear only once in our final table.
- This type of join is also known as equijoin.
We can use multiple conditions in our “on” parameter using “&” (and) and “|” (or) logical operators.
- If there are multiple & operators, then you can simply add all the conditions in a list and PySpark will combine the conditions using & operator. Note, you can’t combine & and | in such a case.
The syntax for other join types is the same, only the parameter that changes is “how”.

Left Join

how = “left” or how = “left_outer” or how = “leftouter”
It will add the unmatched records from the left table in the joined table, filling the columns coming from the right table with null.
In practice, this means that your joined table is guaranteed to contain all the records from the left table.
It is very useful when you are not certain if the link table contains every key.

Right Join

how = “right” or how = “right_outer” or how = “righttouter”
It will add the unmatched records from the right table in the joined table, filling the columns coming from the left table with null.
In practice, this means that your joined table is guaranteed to contain all the records from the right table.
It is very useful when you are not certain if the link table contains every key.

Full Outer Join

how = “outer” or how = “full” or how = “full_outer” or how = “fullouter”
It is simply a fusion of left and right join. It will add the unmatched records from the left and the right table, padding with null.

Left Semi Join

how = “semi” or how = “left_semi” or how =”leftsemi”
It is similar to an inner join but keeps the columns in the left table. It also won’t duplicate the records in the left table if they fulfill the predicate with more than one record in the right table.
Its main purpose is to filter records from a table based on a predicate that is depending on another table.

Left Anti Join

how = “anti” or how = “left_anti” or how =”leftanti”
It is the opposite of the left semi join. It will keep only the records from the left table that do not match the predicate with any record in the right table.
If a record from the left table matches a record from the right table, it gets dropped from the join operation.

Cross Join

how = “cross”
It returns a record for every record pair, regardless of the value the predicates return.
If one table has 4 records and another has 5 records, then the resulting cross joined table will have 4×5=20 records.
They are useful when you want a table that contains every possible combination.

Hopefully, this article gave you a quick understanding of the different joins available in PySpark and how to implement them.

Spread the love

Joins in PySpark: Let’s understand how to join multiple data using PySpark

Importing Data

Introduction to Joins

Inner Join

Left Join

Right Join

Full Outer Join

Left Semi Join

Left Anti Join

Cross Join

Preet Parmar

Leave a Reply Cancel reply

SQL Simplified: Let’s go back to the basics

Handling customers with same name

“useState”: The most commonly used hook in React

Python DefaultDict – Ability to assign default values to your keys

How and when to use “decorators” in Python