Polars DataFrame: Introduction to High-Speed Data Processing
Published on
Stepping beyond the familiar realms of Pandas, Python enthusiasts are now embracing the high-speed, efficient data processing capabilities of Polars DataFrame. Offering a formidable toolkit to manage large datasets, this DataFrame library, entirely built in Rust, is gaining traction among Data Scientists and Analysts. This comprehensive guide dives into the Polars DataFrame, imparting a deeper understanding of its functionalities and showcasing how it stands as a superior alternative to Pandas.
Harnessing Polars Over Pandas
A quick comparison reveals the distinct advantages of Polars over Pandas. Unlike Pandas, Polars abstains from using an index for the DataFrame, making data manipulation substantially simpler. Additionally, Polars employs Apache Arrow arrays for internal data representation, enhancing load times, memory usage, and computational efficiency. Furthermore, being written in Rust, Polars provides more parallel operations, accelerating many tasks. And not to forget, Polars supports lazy evaluation, optimizing queries based on need and minimizing memory usage, a feature not available with Pandas' eager evaluation.
Getting Started with Polars DataFrame
Installing Polars is straightforward. You can use pip or conda commands:
pip install polars
conda install polars
Let's start our journey by creating a Polars DataFrame. Below, we are importing the Polars module and crafting a DataFrame:
import polars as pl
df = pl.DataFrame(
{
'Model': ['iPhone X','iPhone XS','iPhone 12','iPhone 13','Samsung S11','Samsung S12','Mi A1','Mi A2'],
'Sales': [80,170,130,205,400,30,14,8],
'Company': ['Apple','Apple','Apple','Apple','Samsung','Samsung','Xiao Mi','Xiao Mi'],
}
)
df
Unlike Pandas, Polars expects the column header names to be string types. If you want to use integers as column header names, make sure to use them as strings:
df2 = pl.DataFrame(
{
"0" : [1,2,3],
"1" : [80,170,130],
}
)
Each column's data type in Polars is also displayed alongside the header name. If you want to explicitly display the data type of each column, you can use the dtypes properties:
df.dtypes
Retrieving the column names can be achieved with the columns property:
df.columns # Returns ['Model', 'Sales', 'Company']
To get the content of the DataFrame as a list of tuples, use the rows() method:
df.rows()
One crucial feature to note is Polars doesn't use the concept of index, unlike Pandas. The design philosophy of Polars explicitly states that index is not particularly useful in DataFrames.
Unraveling Column Selection in Polars
Selecting columns in Polars is effortless. Specify the column name using the select() method:
df.select('Model')
The statement returns a Polars DataFrame containing the 'Model' column. However, Polars discourages the square bracket indexing method, and its future versions might even eliminate this feature. To select multiple columns, provide the column names as a list:
df.select(['Model','Company'])
``
`
The power of expressions is another major feature in Polars. For example, to retrieve all the integer (specifically Int64) columns in the DataFrame, you can use an expression within the select() method:
```python
df.select(pl.col(pl.Int64))
Polars has a unique way of chaining together expressions. For instance, the below expression selects the 'Model' and 'Sales' columns and then sorts the rows based on the values in the 'Sales' column:
df.select(pl.col(['Model','Sales']).sort_by('Sales'))
If you want to retrieve all the string-type columns, use the pl.Utf8
property:
df.select([pl.col(pl.Utf8)])
Expressions in Polars will be further explained in the next part of the article.
Discovering Row Selection in Polars
To select a single row in a DataFrame, pass in the row number using the row()
method:
df.row(0) # get the first row
This results in a tuple:
('iPhone X', 80, 'Apple')
For selecting multiple rows, Polars recommends using the filter()
function. For instance, if you wish to retrieve all Apple's products, you can use the following expression:
df.filter(pl.col('Company') == 'Apple')
You can specify multiple conditions using logical operators:
df.filter((pl.col('Company') == 'Apple') | (pl.col('Company') == 'Samsung'))
In Polars, you can use the following logical operators:
|
— OR&
— AND~
— Not
Selecting Rows and Columns Simultaneously
Often, you'll want to select rows and columns simultaneously. Chain the filter()
and select()
methods to achieve this:
df.filter(pl.col('Company') == 'Apple').select('Model')
The above statement selects all rows containing 'Apple' and then only displays the 'Model' column. To also display the 'Sales' column, pass in a list to the select()
method:
df.filter(pl.col('Company') == 'Apple').select(['Model','Sales'])
The crux of the story is that Polars DataFrame provides an efficient and high-speed alternative to traditional Pandas, leveraging its lazy evaluation, no-index policy, and parallel operation capabilities. From easy installations to complex data manipulations, Polars comes across as a powerful tool, simplifying data handling and improving memory usage.
Visualize Your Polars Dataframe with PyGWalker
PyGWalker (opens in a new tab) is an Open Source python library that can help you create data visualization from your Polars dataframe with ease.
No need to complete complicated processing with Python coding anymore, simply import your data, and drag and drop variables to create all kinds of data visualizations! Here's a quick demo video on the operation:
Here's how to use PyGWalker in your Jupyter Notebook:
pip install pygwalker
import pygwalker as pyg
gwalker = pyg.walk(df)
Alternatively, you can try it out in Kaggle Notebook/Google Colab:
PyGWalker is built on the support of our Open Source community. Don't forget to check out PyGWalker GitHub (opens in a new tab) and give us a star!
Frequently Asked Questions
-
What are some major advantages of Polars over Pandas?
Polars DataFrame offers several benefits over Pandas. It uses Apache Arrow arrays for efficient data handling, doesn't rely on indices for data manipulation, supports parallel operations, and employs lazy evaluation to optimize queries based on requirements, improving memory usage.
-
How can I select columns in Polars DataFrame?
Polars DataFrame provides the
select()
method to choose columns. You can pass the column name as a string to select a single column or a list of column names to select multiple columns. -
How can I filter rows based on specific conditions in Polars?
The
filter()
method is used to select rows based on specific conditions. You can pass an expression to this method that equates a column to a certain value to filter rows. You can also use logical operators for specifying multiple conditions.