As Data science practitioners we always deal with large datasets and often we need to modify one or multiple columns. Using a loop for that kind of task is slow.
In this blog, I will take you through a few alternative approaches which are faster than loops in python.
Table of Contents
1. Filter
Based on the name we can easily guess what it does. It filters iterable objects for us. We are going to pass the filtering conditions in form of a function and this function is going to use to filter each element in the iterable object.
Syntax
filter(function, iterable)
Now let’s compare the python filter’s performance compare to for loop and while loop.
Here, I’m creating a list of 100000 sequential items and then will do a factorial of even numbers. In the end, we will sum these factorial values.
Note: I'm running all these codes in google colab; depending on your system these results could differ.
# Using for loop import math import time import resource import sys time_start = time.perf_counter() sys.set_int_max_str_digits(0) def factorial(n): return math.factorial(n) test_list = list(range(1, 100000)) result = [] for i in range(len(test_list)) : if test_list[i] % 2 == 0: result.append(test_list[i]) print(sum(result)) time_elapsed = (time.perf_counter() - time_start) memMb=resource.getrusage(resource.RUSAGE_SELF).ru_maxrss/1024.0/1024.0 print ("%f secs %f MByte" % (time_elapsed,memMb)) # Output 2499950000 0.041538 secs 1.082287 MByte
# Using while loop import math import time import resource import sys time_start = time.perf_counter() sys.set_int_max_str_digits(0) def factorial(n): return math.factorial(n) test_list = list(range(1, 100000)) result = [] i=0 while i < len(test_list) : if test_list[i] % 2 == 0: result.append(test_list[i]) i = i + 1 print(sum(result)) time_elapsed = (time.perf_counter() - time_start) memMb=resource.getrusage(resource.RUSAGE_SELF).ru_maxrss/1024.0/1024.0 print ("%f secs %f MByte" % (time_elapsed,memMb)) # Output 2499950000 0.047446 secs 1.085056 MByte
# Using Filter import math import time import resource import sys time_start = time.perf_counter() sys.set_int_max_str_digits(0) def factorial(n): return math.factorial(n) test_list = list(range(1, 100000)) result = filter(lambda x: x % 2 == 0, test_list) print(sum((result))) time_elapsed = (time.perf_counter() - time_start) memMb=resource.getrusage(resource.RUSAGE_SELF).ru_maxrss/1024.0/1024.0 print ("%f secs %f MByte" % (time_elapsed,memMb)) # Output 2499950000 0.022666 secs 1.087833 MByte
2. MAP
This function is really useful if you want to apply a function to each value of an iterable object like a list, tuple, or even a pandas series.
Syntax:
map(function, iterable)
Let’s test it with a for a loop.
# Using for loop import math import time import resource import sys time_start = time.perf_counter() sys.set_int_max_str_digits(0) def factorial(n): return math.factorial(n) test_list = list(range(1, 10000)) result = 0 for i in range(len(test_list)) : test_list[i] = factorial(test_list[i]) print(sum(test_list)) time_elapsed = (time.perf_counter() - time_start) memMb=resource.getrusage(resource.RUSAGE_SELF).ru_maxrss/1024.0/1024.0 print ("%f secs %f MByte" % (time_elapsed,memMb)) #output 284654436382457541 .....................0420940313 12.229430 secs 0.167130 MByte
What about while loop
# Using while loop import math import time import resource import sys time_start = time.perf_counter() sys.set_int_max_str_digits(0) def factorial(n): return math.factorial(n) test_list = list(range(1, 10000)) result = 0 i=0 while i < len(test_list) : test_list[i] = factorial(test_list[i]) i = i + 1 print(sum(test_list)) time_elapsed = (time.perf_counter() - time_start) memMb=resource.getrusage(resource.RUSAGE_SELF).ru_maxrss/1024.0/1024.0 print ("%f secs %f MByte" % (time_elapsed,memMb)) #output 284654436382457541 .....................0420940313 11.263874 secs 1.013439 MByte
Now if we run the same code using the map function
# Using Map import math import time import resource import sys time_start = time.perf_counter() sys.set_int_max_str_digits(0) def factorial(n): return math.factorial(n) test_list = list(range(1, 10000)) result = map(factorial, test_list) print(sum((result))) time_elapsed = (time.perf_counter() - time_start) memMb=resource.getrusage(resource.RUSAGE_SELF).ru_maxrss/1024.0/1024.0 #Output 284654436382457541 .....................0420940313 10.069755 secs 1.013439 MByte
3. Reduce
Python offers a function called reduce() that allows you to reduce a list in a more concise way. this function performs functional computation by taking a function and iterable like a list, tuple, series, etc as arguments and returns a single value as output.
Syntax
reduce(func, iterable)
The reduce() function applies the function of two arguments cumulatively to the items of the list, from left to right to reduce the list into a single value.
Unlike the map() and filter() functions, the reduce() is not a built-in function in Python. The reduce() function belongs to the functools module.
Steps of how to reduce function work in python
- The function passed as an argument is applied to the first two elements of the iterable.
- After this, the function is applied to the previously generated result and the next element in the iterable.
- This process continues until all of the iterable items are iterated.
- A single value is returned as a result of applying the reduce function on the iterable.
Let’s understand the steps with an illustration:
If we have a list of numbers [1,2,3,4,5] that is reduced by applying the addition function.
As a final output will be the sum of all the numbers of the list 15.
Now, let’s compare the python function reduce() performance compare to the for loop and while loop.
For this comparison, I’m going to use an addition() function that adds two input numbers.
def addition(x,y):
return x + y
Next, in order to get the sum of all numbers in a list I will apply the addition function with for and while loops. Finally will apply this addition function as an argument to the reduce function.
Testing with for loop
# Using for loop import math import time import resource import sys time_start = time.perf_counter() sys.set_int_max_str_digits(0) def addition(x,y): return x + y test_list = list(range(1, 1000000)) result = 0 for i in range(len(test_list)) : result = addition(result, test_list[i]) print(result) time_elapsed = (time.perf_counter() - time_start) memMb=resource.getrusage(resource.RUSAGE_SELF).ru_maxrss/1024.0/1024.0 print ("%f secs %f MByte" % (time_elapsed,memMb)) # Output 499999500000 0.289569 secs 0.167500 MByte
Testing with a while loop
# Using while loop import math import time import resource import sys time_start = time.perf_counter() sys.set_int_max_str_digits(0) def addition(x,y): return x + y test_list = list(range(1, 1000000)) result = 0 i=0 while i < len(test_list) : result = addition(result, test_list[i]) i = i + 1 print(result) time_elapsed = (time.perf_counter() - time_start) memMb=resource.getrusage(resource.RUSAGE_SELF).ru_maxrss/1024.0/1024.0 print ("%f secs %f MByte" % (time_elapsed,memMb)) # Output 499999500000 0.429485 secs 0.167500 MByte
Testing with a reduce function
# Using Reduce from functools import reduce import time import resource import sys time_start = time.perf_counter() sys.set_int_max_str_digits(0) def addition(x,y): return x + y test_list = list(range(1, 1000000)) result = reduce(addition, test_list) print(result) time_elapsed = (time.perf_counter() - time_start) memMb=resource.getrusage(resource.RUSAGE_SELF).ru_maxrss/1024.0/1024.0 print ("%f secs %f MByte" % (time_elapsed,memMb)) # Output 499999500000 0.120981 secs 0.167500 MByte
Now, the difference is only in milliseconds. But if we have a really large dataset then these milliseconds will convert into secs and hours.
Conclusion
Thanks for reading! Tried my best to explain how different approaches can perform better, but it will always depend on you. What’s more convenient for you? If you have any feedback, please share it in the comment section, I will be happy to know.
If you like this article and wish to connect with me follow me on Linkedin.
Be curious, keep learning and stay willing to learn new things until we meet next time!
You may also like:
very lucid explanation.Examples are to the point.
Thanks a lot for your feedback. 🙂