Tutorial: Using Futures in Python

Native futures were introduced in Python 3. Like most python programmers who have never done any sort of asynchronous programming will be unfamiliar with futures programming.

What exactly is a Python Future?

A future is a computational construct introduced in python 3. A future provides an interface to represent operation which when is being created might no hold any particular value . However it is expected to do so in the future.

A working example

Consider you are given a list of urls which you need to make a get request against, something like this:

url_list = ['https://docs.python.org/3/library/concurrent.futures.html',
            'https://atomic-temporary-107603627.wpcomstaging.com',
            'http://home.pipeline.com/~hbaker1/Futures.html']

Now , if you do not know anything about futures you have three options as to how you can go about doing this:

  1. Write a simple for loop which makes a request to all these URLs sequentially. This though being the simplest case will ensure that the loop will block on every HTTP call.
  2. Custom writing your own threading module code, and then invoking it. While you might be able to achieve concurrency through this method, you will still have to write a lot of code to do this.
  3. Using multiprocessing to do this. This option is again plagued by the problem that the second module has. You will have to write a lot of code for this, and additionally there is a strict limitation on what you can and cannot pass between different processes in multiprocessing. Hence it might work for the above mentioned case, you are bound to run into problems later when you are dealing with complex.

Sample Code(without futures)

You could end up writing something like this:

import requests
# need to get all the urls
url_list = ['https://docs.python.org/3/library/concurrent.futures.html',
'https://atomic-temporary-107603627.wpcomstaging.com',
'http://home.pipeline.com/~hbaker1/Futures.html']


for url in url_list:
    response = requests.get(url)
    print(response, response.status_code)

Every subsequent call after the first call will wait till the one before it finishes. This is wastage of resources as well as processing power, and not to mention in case if one URL fetch fails everything after that will fail also(you could ideally handle this with relevant try-except block, but ideally I’d want to write something that doesn’t involve that as well).

Sample Code(With Futures)

We can do the same thing with futures as follows:

import requests
from concurrent.futures import ThreadPoolExecutor
url_list = ['https://docs.python.org/3/library/concurrent.futures.html', \
            'https://atomic-temporary-107603627.wpcomstaging.com', \
            'http://home.pipeline.com/~hbaker1/Futures.html']

def get_futures_get(urls):
    results = []
    currs = ThreadPoolExecutor(max_workers=5)
    for url in urls:
        curr_future_result = currs.submit(worker_func, url, results)

        # you can also use
        # curr_future_result.result() to actually get
        # the result from the future object. This however
        # is a blocking call and should be done only after the
        # line written below. For the purpose of the example I 
        # updating the result in the list I passed along with the
        # function.

    currs.shutdown(wait=True)
    return results

def worker_func(url, result_list):
     response = requests.get(url)

     # note: This line is not thread safe, I am doing this
     # only for demonstration purposes. Do not write your
     # production code like this.
     result_list.append([response, response.status_code])

The following line

currs.shutdown(wait=True)

make sure that the ThreadPoolExecutor shutdown till all the threads/futures are evaluated.

Dissecting the code(only futures) line by line

currs = ThreadPoolExecutor(max_workers=5)

This initializes a pool of threads which can at any point contain a maximum of 5 threads. Whenever a tasks is submitted to a thread pool executor it spins up a new thread if no other thread is idle and the number of busy threads is less than the max_workers flag defined.

for url in urls: 
    curr_future_result = currs.submit(worker_func, url, results)

Here we iterate over the list of all the urls and submit each url to be processed by a worker function that we have already written. Notice how we pass the function reference and the params separately.

This returns a future object. This can be checked for results.

currs.shutdown(wait=True)

This line tells the thread pool to shutdown. You will not be able to submit anymore tasks to this thread pool.

Performance Comparison

While the above code gives a rough idea about how to use futures effectively. However I still don’t know what’s the performance boost using futures offers when I am doing a lot of IO.

Hence I wrote a small testing the performance of both the implementations:

import time
import requests
from concurrent.futures import ThreadPoolExecutor

url_list = ['https://docs.python.org/3/library/concurrent.futures.html',\
            'https://atomic-temporary-107603627.wpcomstaging.com', 
            'http://home.pipeline.com/~hbaker1/Futures.html']
def get_sequential_get(urls):
    """
    Function which does a get requests
    one after the other
    """
    result = []
    for url in urls:
        response = requests.get(url)
        result.append([response, response.status_code])
    return result


def get_futures_get(urls):
    results = []
    currs = ThreadPoolExecutor(max_workers=5)
    for url in urls:
        currs.submit(worker_func, url, results)
    currs.shutdown(wait=True)
    return results

def worker_func(url, result_list):
     response = requests.get(url)
     result_list.append([response, response.status_code])

def calculate_function_time(curr_func, **kwargs):
    start = time.localtime()
    curr_func(**kwargs)
    end = time.localtime()
    return end.tm_sec - start.tm_sec

if __name__ == '__main__':
    print('Time taken by normal implementation {} seconds'.format(calculate_function_time(get_sequential_get, **{'urls': url_list})))
    print('Time taken by futures implementation {} seconds'.format(calculate_function_time(get_futures_get, **{'urls': url_list})))

On running the above script we get a performance difference of about 6x.

Time taken by normal implementation 6 seconds
Time taken by futures implementation 1 seconds

However over a course of  1000 iterations it  goes down to about 3X(which is still a great performance boost).

When not to use futures(or threading in general wrt to Python)

The most important caveat while using futures(in python) is to understand that futures(and threading) can only give you a performance boost whenever there is some IO that eats up resources while the program is blocked on it.

Using futures or threading for computation or CPU heavy tasks is not recommended. It will not result in any noticeable performance gain because of the infamous GIL in python. Read more about the GIL here.

Conclusions and Further

I will be doing an in depth review of the futures implementation in python3.

This will allow us to use futures effectively then what we do now. Additionally having in depth knowledge about an implementation helps  avoid any hidden performance bottle necks.

In the meantime tell me if you use python futures in production and your experience with them!!!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s