In the big data era, it is quite common to have dataframes that consist of hundreds or even thousands of columns. And in such cases, even printing them out can sometimes be tricky as you somehow need to ensure that the data is presented in a clear but also efficient way.
In this article, I am going to explore the three basic ways one can follow in order to display a PySpark dataframe in a table format. For each case, I am also going to discuss when to use or avoid it, depending on the shape of data you have to deal with.
Print a PySpark DataFrame
To get started, let’s consider the minimal pyspark dataframe below as an example:
spark_df = sqlContext.createDataFrame(
[
(1, "Mark", "Brown"),
(2, "Tom", "Anderson"),
(3, "Joshua", "Peterson")
],
('id', 'firstName', 'lastName')
)
The most obvious way one can use in order to print a PySpark dataframe is the show()
method:
>>> df.show()
+---+---------+--------+
| id|firstName|lastName|
+---+---------+--------+
| 1| Mark| Brown|
| 2| Tom|Anderson|
| 3| Joshua|Peterson|
+---+---------+--------+
By default, only the first 20 rows will be printed out. In case you want to display more rows than that, then you can simply pass the argument n
, that is show(n=100)
.
Print a PySpark DataFrame vertically
Now let’s consider another example in which our dataframe has a lot of columns:
spark_df = sqlContext.createDataFrame(
[
(
1, 'Mark', 'Brown', 25, 'student', 'E15',
'London', None, 'United Kingdom'
),
(
2, 'Tom', 'Anderson', 30, 'Developer', 'SW1',
'London', None, 'United Kingdom'
),
(
3, 'Joshua', 'Peterson', 33, 'Social Media Manager',
'99501', 'Juneau', 'Alaska', 'USA'
),
],
(
'id', 'firstName', 'lastName', 'age', 'occupation',
'postcode', 'city', 'state', 'country',
)
)
Now if we attempt to print the dataframe out (and depending on the size of your screen), the output could be quite messy:
>>> spark_df.show()
+---+---------+--------+---+--------------------+--------+------+------+--------------+
| id|firstName|lastName|age| occupation|postcode| city| state| country|
+---+---------+--------+---+--------------------+--------+------+------+--------------+
| 1| Mark| Brown| 25| student| E15|London| null|United Kingdom|
| 2| Tom|Anderson| 30| Developer| SW1|London| null|United Kingdom|
| 3| Joshua|Peterson| 33|Social Media Manager| 99501|Juneau|Alaska| USA|
+---+---------+--------+---+--------------------+--------+------+------+--------------+
Given that when working with real world data, the above behaviour is frequent due to the size of the dataframes we somehow need to come up with a solution to properly display our data in a way that it is readable. Also, you should avoid making assumptions about the screen size of other users (e.g. when you want to include this output in the logs) therefore, you need to ensure that the results are always consistent and presented in the same way to all users.
A typical workaround is to print the dataframe vertically. To do so, we need to pass the vertical
argument into the show()
method:
>>> spark_df.show(vertical=True)
-RECORD 0--------------------------
id | 1
firstName | Mark
lastName | Brown
age | 25
occupation | student
postcode | E15
city | London
state | null
country | United Kingdom
-RECORD 1--------------------------
id | 2
firstName | Tom
lastName | Anderson
age | 30
occupation | Developer
postcode | SW1
city | London
state | null
country | United Kingdom
-RECORD 2--------------------------
id | 3
firstName | Joshua
lastName | Peterson
age | 33
occupation | Social Media Manager
postcode | 99501
city | Juneau
state | Alaska
country | USA
Even though the output above is not in table format, it is sometimes the only way to display your data in a consistent and readable manner.
Convert a PySpark DataFrame into pandas
A third option is to convert your pyspark dataframe into a pandas dataframe and finally print it out:
>>> pandas_df = spark_df.toPandas()
>>> print(pandas_df)
id firstName lastName
0 1 Mark Brown
1 2 Tom Anderson
2 3 Joshua Peterson
Note that this approach is not recommended when you have to deal with fairly large dataframes, as Pandas needs to load all the data into memory. If this is the case, the following configuration will optimize the conversion of a large spark dataframe to a pandas one:
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
For more details regarding PyArrow optimizations when converting spark to pandas dataframe and vice-versa, you can refer to my Medium article below
Speeding up the conversion between PySpark and Pandas DataFrames
Conclusion
In this article, we explored a pretty basic operation in PySpark. In most of the cases printing a PySpark dataframe vertically is the way to go due to the shape of the object which is typically quite large to fit into a table format. It is also safer to assume that most users don’t have wide screens that could possibly fit large dataframes in tables. However, in cases where you have some toy data the built-in show()
method with the default arguments will do the trick.