Contents

Performance of pandas apply and NumPy vectorize

Both pandas apply and NumPy vectorize functions are useful in manipulating the pandas DataFrame, but these functions can have specific uses and performance characteristics.

pandas apply function can be used to apply built-in or custom functions along an axis of the DataFrame.

pandas apply function is very flexible and can be used for applying complex manipulations such as calculations with conditional logic on the pandas DataFrame. However, pandas apply function is limited by the performance issue.

The NumPy vectorize can be also used on pandas DataFrame but it is developed for NumPy arrays with uniform data types.

Tip
NumPy vectorize has a greater performance on NumPy arrays, but it may not perform as good as pandas built-in vectorized operations.

The following examples demonstrate how to use pandas apply and NumPy vectorize functions along with their performance for pandas DataFrame manipulations.

Create DataFrame

Create a random pandas DataFrame,

# import package
import pandas as pd
import numpy as np

# set random seed for reproducibility
np.random.seed(42)

df = pd.DataFrame({
    'col1': np.random.rand(1000000),
    'col2': np.random.rand(1000000)
})

# view first few columns of DataFrame
df.head(4)
       col1      col2
0  0.374540  0.595156
1  0.950714  0.364717
2  0.731994  0.005376
3  0.598658  0.561088

We will use this DataFrame for performing manipulations using pandas apply, NumPy vectorize, and pandas vectorization.

pandas apply

We will use the pandas apply function for calculating the square of the col2

df['col3']=df['col2'].apply(lambda x: x ** 2)

# view DataFrame
df.head(4)
       col1      col2      col3
0  0.374540  0.595156  0.354210
1  0.950714  0.364717  0.133019
2  0.731994  0.005376  0.000029
3  0.598658  0.561088  0.314819

The col3 contains the square values of col2.

NumPy vectorize

Now, use NumPy vectorize function for calculating the square of the col2.

Here, we will create a custom function for calculating the square and will pass it to the vectorize function.

# import package
import numpy as np


# create a function
def square(a):
    return a ** 2
	
# NumPy vectorize
vectorized_sq = np.vectorize(square)
df['col4'] = vectorized_sq(df['col2'])

# view DataFrame
df.head(4)
       col1      col2      col3      col4
0  0.374540  0.595156  0.354210  0.354210
1  0.950714  0.364717  0.133019  0.133019
2  0.731994  0.005376  0.000029  0.000029
3  0.598658  0.561088  0.314819  0.314819

The col4 contains the square values of col2 calculated using vectorize function.

pandas vectorization

Now, use the pandas vectorization function for calculating the square of the col2.

# import package
import pandas as pd


# create a function
df['col5'] = df['col2'] ** 2


 df.head(4)
       col1      col2      col3      col4      col5
0  0.374540  0.595156  0.354210  0.354210  0.354210
1  0.950714  0.364717  0.133019  0.133019  0.133019
2  0.731994  0.005376  0.000029  0.000029  0.000029
3  0.598658  0.561088  0.314819  0.314819  0.314819

The col4 contains the square values of col2 calculated using pandas vectorization.

Performance comparison

We will compare the performance of pandas apply, NumPy vectorize, and pandas vectorization.

Calculate performance for pandas apply:

import time

start_time = time.time()
df['col3']=df['col2'].apply(lambda x: x ** 2)
print(f"time for pandas apply: {time.time() - start_time} seconds")

# output
time for pandas apply: 0.3888976573944092 seconds

The time required for pandas apply function to calculate the square for all values in the column is 0.38s.

Calculate performance for NumPy vectorize:

start_time = time.time()
vectorized_sq = np.vectorize(square)
df['col4'] = vectorized_sq(df['col2'])
print(f"time for NumPy vectorize: {time.time() - start_time} seconds")

# output
time for NumPy vectorize: 0.26853299140930176 seconds

The time required for NumPy vectorize function to calculate the square for all values in the column is 0.26s. This is much faster than the pandas apply function.

Calculate performance for pandas vectorization:

start_time = time.time()
df['col4'] = df['col2'] ** 2
print(f"time for pandas vectorize: {time.time() - start_time} seconds")

time for pandas vectorization:  0.010679006576538086 seconds

The time required for pandas vectorization to calculate the square for all values in the column is 0.01s. This is much more faster than the pandas apply and NumPy vectorize functions.

In summary, the NumPy vectorize function has better performance than the pandas apply function. But pandas vectorization outperforms NumPy vectorize and pandas apply.

Hence, it is essential to choose the right function for pandas data manipulation, especially for large datasets.