Performance of pandas apply and NumPy vectorize
Both pandas apply
and NumPy vectorize
functions are useful in manipulating the pandas DataFrame, but these functions can have specific uses and performance
characteristics.
pandas apply
function can be used to apply built-in or custom functions along an axis of the DataFrame.
pandas apply
function is very flexible and can be used for applying complex manipulations such as calculations with conditional logic
on the pandas DataFrame. However, pandas apply
function is limited by the performance issue.
The NumPy vectorize
can be also used on pandas DataFrame but it is developed for NumPy arrays with uniform data types.
vectorize
has a greater performance on NumPy arrays, but it may not perform as good as pandas built-in vectorized operations.The following examples demonstrate how to use pandas apply
and NumPy vectorize
functions along with their performance for pandas DataFrame manipulations.
Create DataFrame
Create a random pandas DataFrame,
# import package
import pandas as pd
import numpy as np
# set random seed for reproducibility
np.random.seed(42)
df = pd.DataFrame({
'col1': np.random.rand(1000000),
'col2': np.random.rand(1000000)
})
# view first few columns of DataFrame
df.head(4)
col1 col2
0 0.374540 0.595156
1 0.950714 0.364717
2 0.731994 0.005376
3 0.598658 0.561088
We will use this DataFrame for performing manipulations using pandas apply
, NumPy vectorize
, and pandas vectorization.
pandas apply
We will use the pandas apply
function for calculating the square of the col2
df['col3']=df['col2'].apply(lambda x: x ** 2)
# view DataFrame
df.head(4)
col1 col2 col3
0 0.374540 0.595156 0.354210
1 0.950714 0.364717 0.133019
2 0.731994 0.005376 0.000029
3 0.598658 0.561088 0.314819
The col3
contains the square values of col2
.
NumPy vectorize
Now, use NumPy vectorize
function for calculating the square of the col2
.
Here, we will create a custom function for calculating the square and will pass it to the vectorize
function.
# import package
import numpy as np
# create a function
def square(a):
return a ** 2
# NumPy vectorize
vectorized_sq = np.vectorize(square)
df['col4'] = vectorized_sq(df['col2'])
# view DataFrame
df.head(4)
col1 col2 col3 col4
0 0.374540 0.595156 0.354210 0.354210
1 0.950714 0.364717 0.133019 0.133019
2 0.731994 0.005376 0.000029 0.000029
3 0.598658 0.561088 0.314819 0.314819
The col4
contains the square values of col2
calculated using vectorize
function.
pandas vectorization
Now, use the pandas vectorization function for calculating the square of the col2
.
# import package
import pandas as pd
# create a function
df['col5'] = df['col2'] ** 2
df.head(4)
col1 col2 col3 col4 col5
0 0.374540 0.595156 0.354210 0.354210 0.354210
1 0.950714 0.364717 0.133019 0.133019 0.133019
2 0.731994 0.005376 0.000029 0.000029 0.000029
3 0.598658 0.561088 0.314819 0.314819 0.314819
The col4
contains the square values of col2
calculated using pandas vectorization.
Performance comparison
We will compare the performance of pandas apply
, NumPy vectorize
, and pandas vectorization.
Calculate performance for pandas apply
:
import time
start_time = time.time()
df['col3']=df['col2'].apply(lambda x: x ** 2)
print(f"time for pandas apply: {time.time() - start_time} seconds")
# output
time for pandas apply: 0.3888976573944092 seconds
The time required for pandas apply
function to calculate the square for all values in the column is 0.38s.
Calculate performance for NumPy vectorize
:
start_time = time.time()
vectorized_sq = np.vectorize(square)
df['col4'] = vectorized_sq(df['col2'])
print(f"time for NumPy vectorize: {time.time() - start_time} seconds")
# output
time for NumPy vectorize: 0.26853299140930176 seconds
The time required for NumPy vectorize
function to calculate the square for all values in the column is 0.26s. This is much faster than the pandas apply
function.
Calculate performance for pandas vectorization:
start_time = time.time()
df['col4'] = df['col2'] ** 2
print(f"time for pandas vectorize: {time.time() - start_time} seconds")
time for pandas vectorization: 0.010679006576538086 seconds
The time required for pandas vectorization to calculate the square for all values in the column is 0.01s. This is much more faster than the pandas apply
and NumPy vectorize
functions.
In summary, the NumPy vectorize
function has better performance than the pandas apply
function. But pandas vectorization outperforms NumPy vectorize
and pandas apply
.
Hence, it is essential to choose the right function for pandas data manipulation, especially for large datasets.