Skip to main content

Introduction to Python Generators

Generator Functions

To create a generator, you define a function as you normally would but use the yieldstatement instead of return, indicating to the interpreter that this function should be treated as an iterator:
def countdown(num):
    print('Starting')
    while num > 0:
        yield num
        num -= 1
The yield statement pauses the function and saves the local state so that it can be resumed right where it left off.
What happens when you call this function?
>>> def countdown(num):
...     print('Starting')
...     while num > 0:
...         yield num
...         num -= 1
...
>>> val = countdown(5)
>>> val
<generator object countdown at 0x10213aee8>
Calling the function does not execute it. We know this because the string Starting did not print. Instead, the function returns a generator object which is used to control execution.
Generator objects execute when next() is called:
>>> next(val)
Starting
5
When calling next() the first time, execution begins at the start of the function body and continues until the next yield statement where the value to the right of the statement is returned, subsequent calls to next() continue from the yield statement to the end of the function, and loop around and continue from the start of the function body until another yield is called. If yield is not called (which in our case means we don’t go into the if function because num <= 0) a StopIteration exception is raised:
>>> next(val)
4
>>> next(val)
3
>>> next(val)
2
>>> next(val)
1
>>> next(val)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration

Generator Expressions

Just like list comprehensions, generators can also be written in the same manner except they return a generator object rather than a list:
>>> my_list = ['a', 'b', 'c', 'd']
>>> gen_obj = (x for x in my_list)
>>> for val in gen_obj:
...     print(val)
...
a
b
c
d
Take note of the parens on either side of the second line denoting a generator expression, which, for the most part, does the same thing that a list comprehension does, but does it lazily:
>>> import sys
>>> g = (i * 2 for i in range(10000) if i % 3 == 0 or i % 5 == 0)
>>> print(sys.getsizeof(g))
72
>>> l = [i * 2 for i in range(10000) if i % 3 == 0 or i % 5 == 0]
>>> print(sys.getsizeof(l))
38216
Be careful not to mix up the syntax of a list comprehension with a generator expression - []vs () - since generator expressions can run slower than list comprehensions (unless you run out of memory, of course):
>>> import cProfile
>>> cProfile.run('sum((i * 2 for i in range(10000000) if i % 3 == 0 or i % 5 == 0))')
         4666672 function calls in 3.531 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  4666668    2.936    0.000    2.936    0.000 <string>:1(<genexpr>)
        1    0.001    0.001    3.529    3.529 <string>:1(<module>)
        1    0.002    0.002    3.531    3.531 {built-in method exec}
        1    0.592    0.592    3.528    3.528 {built-in method sum}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}


>>> cProfile.run('sum([i * 2 for i in range(10000000) if i % 3 == 0 or i % 5 == 0])')
         5 function calls in 3.054 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    2.725    2.725    2.725    2.725 <string>:1(<listcomp>)
        1    0.078    0.078    3.054    3.054 <string>:1(<module>)
        1    0.000    0.000    3.054    3.054 {built-in method exec}
        1    0.251    0.251    0.251    0.251 {built-in method sum}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
This is particularly easy (even for senior developers) to do in the above example since both output the exact same thing in the end.
NOTE: Keep in mind that generator expressions are drastically faster when the size of your data is larger than the available memory.

Use Cases

Generators are perfect for reading a large number of large files since they yield out data a single chunk at a time irrespective of the size of the input stream. They can also result in cleaner code by decoupling the iteration process into smaller components.

Example 1

def emit_lines(pattern=None):
    lines = []
    for dir_path, dir_names, file_names in os.walk('test/'):
        for file_name in file_names:
            if file_name.endswith('.py'):
                for line in open(os.path.join(dir_path, file_name)):
                    if pattern in line:
                        lines.append(line)
    return lines
This function loops through a set of files in the specified directory. It opens each file and then loops through each line to test for the pattern match.
This works fine with a small number of small files. But, what if we’re dealing with extremely large files? And what if there are a lot of them? Fortunately, Python’s open() function is efficient and doesn’t load the entire file into memory. But what if our matches list far exceeds the available memory on our machine?
So, instead of running out of space (large lists) and time (nearly infinite amount of data stream) when processing large amounts of data, generators are the ideal things to use, as they yield out data one time at a time (instead of creating intermediate lists).
Let’s look at the generator version of the above problem and try to understand why generators are apt for such use cases using processing pipelines.
We divided our whole process into three different components:
  • Generating set of filenames
  • Generating all lines from all files
  • Filtering out lines on the basis of pattern matching
def generate_filenames():
    """
    generates a sequence of opened files
    matching a specific extension
    """
    for dir_path, dir_names, file_names in os.walk('test/'):
        for file_name in file_names:
            if file_name.endswith('.py'):
                yield open(os.path.join(dir_path, file_name))

def cat_files(files):
    """
    takes in an iterable of filenames
    """
    for fname in files:
        for line in fname:
            yield line

def grep_files(lines, pattern=None):
    """
    takes in an iterable of lines
    """
    for line in lines:
        if pattern in line:
            yield line


py_files = generate_filenames()
py_file = cat_files(py_files)
lines = grep_files(py_file, 'python')
for line in lines:
    print (line)
In the above snippet, we do not use any extra variables to form the list of lines, instead we create a pipeline which feeds its components via the iteration process one item at a time. grep_files takes in a generator object of all the lines of *.py files. Similarly, cat_filestakes in a generator object of all the filenames in a directory. So this is how the whole pipeline is glued via iterations.

Example 2

Generators work great for web scraping and crawling recursively:
import requests
import re


def get_pages(link):
    links_to_visit = []
    links_to_visit.append(link)
    while links_to_visit:
        current_link = links_to_visit.pop(0)
        page = requests.get(current_link)
        for url in re.findall('<a href="([^"]+)">', str(page.content)):
            if url[0] == '/':
                url = current_link + url[1:]
            pattern = re.compile('https?')
            if pattern.match(url):
                links_to_visit.append(url)
        yield current_link


webpage = get_pages('http://sample.com')
for result in webpage:
    print(result)
Here, we simply fetch a single page at a time and then perform some sort of action on the page when execution occurs. What would this look like without a generator? Either the fetching and processing would have to happen within the same function (resulting in highly coupled code that’s hard to test) or we’d have to fetch all the links before processing a single page.

Conclusion

Generators allow us to ask for values as and when we need them, making our applications more memory efficient and perfect for infinite streams of data. They can also be used to refactor out the processing from loops resulting in cleaner, decoupled code. If you’d like to see more examples, check out Generator Tricks for Systems Programmers and Iterator Chains as Pythonic Data Processing Pipelines.

Popular posts from this blog

How to read or extract text data from passport using python utility.

Hi ,  Lets get start with some utility which can be really helpful in extracting the text data from passport documents which can be images, pdf.  So instead of jumping to code directly lets understand the MRZ, & how it works basically. MRZ Parser :                 A machine-readable passport (MRP) is a machine-readable travel document (MRTD) with the data on the identity page encoded in optical character recognition format Most travel passports worldwide are MRPs.  It can have 2 lines or 3 lines of machine-readable data. This method allows to process MRZ written in accordance with ICAO Document 9303 (endorsed by the International Organization for Standardization and the International Electrotechnical Commission as ISO/IEC 7501-1)). Some applications will need to be able to scan such data of someway, so one of the easiest methods is to recognize it from an image file. I 'll show you how to retrieve the MRZ information from a picture of a passport using the PassportE

How to generate class diagrams pictures in a Django/Open-edX project from console

A class diagram in the Unified Modeling Language ( UML ) is a type of static structure diagram that describes the structure of a system by showing the system’s classes, their attributes, operations (or methods), and the relationships among objects. https://github.com/django-extensions/django-extensions Step 1:   Install django extensions Command:  pip install django-extensions Step 2:  Add to installed apps INSTALLED_APPS = ( ... 'django_extensions' , ... ) Step 3:  Install diagrams generators You have to choose between two diagram generators: Graphviz or Dotplus before using the command or you will get: python manage.py graph_models -a -o myapp_models.png Note:  I prefer to use   pydotplus   as it easier to install than Graphviz and its dependencies so we use   pip install pydotplus . Command:  pip install pydotplus Step 4:  Generate diagrams Now we have everything installed and ready to generate diagrams using the comm

How to Remove course from Open-edX

Go to vagrant  => 1. In the edx-platform directory:  - cd /edx/app/edxapp/edx-platform 2. Run the following Django management command:   - sudo -u www-data /edx/bin/python.edxapp /edx/bin/manage.edxapp lms dump_course_ids --settings aws    - sudo -u www-data /edx/bin/python.edxapp /edx/bin/manage.edxapp lms dump_course_ids --settings=devstack 3. Find the course ID which you'd like to delete in the resulting list of course IDs. 4. Copy the course ID into the following command and run it:  - sudo -u www-data /edx/bin/python.edxapp /edx/bin/manage.edxapp cms delete_course <COURSE_ID> --settings aws  -   sudo -u www-data /edx/bin/python.edxapp /edx/bin/manage.edxapp cms delete_course <COURSE_ID> --settings=devstack  - You'll be asked to verify the deletion . To verify the deletion, run the command from step 2 above and ensure that the course ID is not in the list. Help reference : https://openedx.atlassian.net/wiki/spa