The Data Wrangling Workshop
上QQ阅读APP看书,第一时间看更新

Basic File Operations in Python

In the previous topic, we investigated a few advanced data structures and also learned neat and useful functional programming methods to manipulate them without side effects. In this topic, we will learn about a few OS-level functions in Python, such as working with files, but these could also include working with printers, and even the internet. We will concentrate mainly on file-related functions and learn how to open a file, read the data line by line or all at once, and finally, how to cleanly close the file we opened. The closing operation of a file should be done cautiously, which is ignored most of the time by developers. When handling file operations, we often run into very strange and hard-to-track-down bugs because a process opened a file and did not close it properly. We will apply a few of the techniques we have learned about to a file that we will read to practice our data wrangling skills further.

Exercise 2.08: File Operations

In this exercise, we will learn about the OS module of Python, and we will also look at two very useful ways to write and read environment variables. The power of writing and reading environment variables is often very important when designing and developing data-wrangling pipelines.

Note

In fact, one of the factors of the famous 12-factor app design is the very idea of storing configuration in the environment. You can check it out at this URL: https://12factor.net/config.

The purpose of the OS module is to give you ways to interact with OS-dependent functionalities. In general, it is pretty low-level and most of the functions from there are not useful on a day-to-day basis; however, some are worth learning. os.environ is the collection Python maintains with all the present environment variables in your OS. It gives you the power to create new ones. The os.getenv function gives you the ability to read an environment variable:

  1. Import the os module.

    import os

  2. Set a few environment variables:

    os.environ['MY_KEY'] = "MY_VAL"

    os.getenv('MY_KEY')

    The output is as follows:

    'MY_VAL'

  3. Print the environment variable when it is not set:

    print(os.getenv('MY_KEY_NOT_SET'))

    The output is as follows:

    None

  4. Print the os environment:

    print(os.environ)

    Note

    The output has not been added for security reasons.

    To access the source code for this specific section, please refer to https://packt.live/2YCZAnC.

    You can also run this example online at https://packt.live/3fCqnaB.

After executing the preceding code, you will be able to see that you have successfully printed the value of MY_KEY, and when you tried to print MY_KEY_NOT_SET, it printed None. Therefore, utilizing the OS module, you will be able to set the value of environment variables in your system.

File Handling

In this section, we will learn about how to open a file in Python. We will learn about the different modes that we can use and what they stand for when opening a file. Python has a built-in open function that we will use to open a file. The open function takes a few arguments as input. Among them, the first one, which stands for the name of the file you want to open, is the only one that's mandatory. Everything else has a default value. When you call open, Python uses underlying system-level calls to open a file handler and return it to the caller.

Usually, a file can be opened either for reading or writing. If we open a file in one mode, the other operation is not supported. Whereas reading usually means we start to read from the beginning of an existing file, writing can mean either starting a new file and writing from the beginning or opening an existing file and appending to it.

Here is a table showing you all the different modes Python supports for opening a file:

Figure 2.9: Modes to read a file

There is also a deprecated mode, U, which does nothing in a Python 3 environment. One thing we must remember here is that Python will always differentiate between t and b modes, even if the underlying OS doesn't. This is because, in b mode, Python does not try to decode what it is reading and gives us back the byteobject instead, whereas, in t mode, it does try to decode the stream and gives us back the string representation.

You can open a file for reading with the command that follows. The path (highlighted) would need to be changed based on the location of the file on your system.

fd = open("../datasets/data_temporary_files.txt")

We will discuss some more functions in the following section.

Note

The file can be found here https://packt.live/2YGpbfv.

This is opened in rt mode (opened for the reading+text mode). You can open the same file in binary mode if you want. To open the file in binary mode, use the rb (read, byte) mode:

fd = open('AA.txt',"rb")

fd

The output is as follows:

<_io.BufferedReader name='../datasets/AA.txt'>

Note

The file can be found here: https://packt.live/30OSkaP.

This is how we open a file for writing:

fd = open("../datasets/data_temporary_files.txt ", "w")

fd

The output is as follows:

<_io.TextIOWrapper name='../datasets/data_temporary_files.txt ' mode='w' encoding='cp1252'>

Let's practice this concept in the following exercise.

Exercise 2.09: Opening and Closing a File

In this exercise, we will learn how to close a file after opening it.

Note

The file we will be working on can be found here: https://packt.live/30OSkaP.

We must close a file once we have opened it. A lot of system-level bugs can occur due to a dangling file handler, which means the file is still being modified, even though the application is done using it. Once we close a file, no further operations can be performed on that file using that specific file handler.

  1. Open a file in binary mode:

    fd = open("../datasets/AA.txt", "rb")

    Note

    Change the highlighted path based on the location of the file on your system. The video of this exercise shows how to use the same function on a different file. There, you'll also get a glimpse of the function used to write to files, which is something you'll learn about later in the chapter.

  2. Close a file using close():

    fd.close()

Python also gives us a closed flag with the file handler. If we print it before closing, then we will see False, whereas if we print it after closing, then we will see True. If our logic checks whether a file is properly closed or not, then this is the flag we want to use.

Note

To access the source code for this specific section, please refer to https://packt.live/30R6FDC.

You can also run this example online at https://packt.live/3edLoI8.

The with Statement

In this section, we will learn about the with statement in Python and how we can effectively use it in the context of opening and closing files.

The with command is a compound statement in Python, like if and for, designed to combine multiple lines. Like any compound statement, with also affects the execution of the code enclosed by it. In the case of with, it is used to wrap a block of code in the scope of what we call a Context Manager in Python. A context manager is a convenient way to work with resources and will help avoid forgetting to close the resource. A detailed discussion of context managers is out of the scope of this exercise and this topic in general, but it is sufficient to say that if a context manager is implemented inside the open call for opening a file in Python, it is guaranteed that a close call will automatically be made if we wrap it inside a with statement.

Note

There is an entire PEP for with at https://www.python.org/dev/peps/pep-0343/. We encourage you to look into it.

Opening a File Using the with Statement

Open a file using the with statement:

with open("../datasets/AA.txt") as fd:

    print(fd.closed)

print(fd.closed)

The output is as follows:

False

True

If we execute the preceding code, we will see that the first print will end up printing False, whereas the second one will print True. This means that as soon as the control goes out of the with block, the file descriptor is automatically closed.

Note

This is by far the cleanest and most Pythonic way to open a file and obtain a file descriptor for it. We encourage you to use this pattern whenever you need to open a file by yourself.

Exercise 2.10: Reading a File Line by Line

In this exercise, we'll read a file line by line. Let's go through the following steps to do so:

  1. Open a file and then read the file line by line and print it as we read it:

    with open("../datasets/Alice`s Adventures in Wonderland, "\

              "by Lewis Carroll", encoding="utf8") as fd:

        for line in fd:

            print(line)

    Note

    Do not forget to change the path (highlighted) of the file based on its location on your system.

    The output (partially shown) is as follows:

    Figure 2.10: Screenshot from the Jupyter notebook

    Looking at the preceding code, we can see why it is important. With this short snippet of code, you can even open and read files that are many gigabytes in size, line by line, and without flooding or overrunning the system memory. There is another explicit method in the file descriptor object, called readline, which reads one line at a time from a file.

  2. Duplicate the same for loop, just after the first one:

    with open("../datasets/Alice`s Adventures in Wonderland, "\

              "by Lewis Carroll", encoding="utf8") as fd:

        for line in fd:

            print(line)

        print("Ended first loop")

        for line in fd:

            print(line)

    Note

    Do not forget to change the path (highlighted) of the file based on its location on your system.

    The output (partially shown) is as follows:

Figure 2.11: Section of the open file

Note

To access the source code for this specific section, please refer to https://packt.live/37B7aTX.

You can also run this example online at https://packt.live/3fCqWBf.

Let's look at the last exercise of this chapter.

Exercise 2.11: Writing to a File

In this exercise, we'll look into file operations by showing you how to read from a dictionary and write to a file. We will write a few lines to a file and read the file:

Note

data_temporary_files.txt can be found at https://packt.live/2YGpbfv.

Let's go through the following steps:

  1. Use the write function from the file descriptor object:

    data_dict = {"India": "Delhi", "France": "Paris",\

                 "UK": "London", "USA": "Washington"}

    with open("../datasets/data_temporary_files.txt", "w") as fd:

        for country, capital in data_dict.items():

            fd.write("The capital of {} is {}\n"\

                     .format(country, capital))

    Note

    Throughout this exercise, don't forget to change the path (highlighted) based on where you have stored the text file.

  2. Read the file using the following command:

    with open("../datasets/data_temporary_files.txt", "r") as fd:

        for line in fd:

            print(line)

    The output is as follows:

    The capital of India is Delhi

    The capital of France is Paris

    The capital of UK is London

    The capital of USA is Washington

  3. Use the print function to write to a file using the following command:

    data_dict_2 = {"China": "Beijing", "Japan": "Tokyo"}

    with open("../datasets/data_temporary_files.txt", "a") as fd:

        for country, capital in data_dict_2.items():

            print("The capital of {} is {}"\

                  .format(country, capital), file=fd)

  4. Read the file using the following command:

    with open("../datasets/data_temporary_files.txt", "r") as fd:

        for line in fd:

            print(line)

    The output is as follows:

    The capital of India is Delhi

    The capital of France is Paris

    The capital of UK is London

    The capital of USA is Washington

    The capital of China is Beijing

    The capital of Japan is Tokyo

    Note

    In the second case, we did not add an extra newline character, \n, at the end of the string to be written. The print function does that automatically for us.

    To access the source code for this specific section, please refer to https://packt.live/2BkVh8j.

    You can also run this example online at https://packt.live/3hB7xT0.

With this, we will end this topic. Just like the previous topics, we have designed an activity for you to practice your newly acquired skills.

Activity 2.02: Designing Your Own CSV Parser

A CSV file is something you will encounter a lot in your life as a data practitioner. A CSV file is a comma-separated file where data from a tabular format is generally stored and separated using commas, although other characters can also be used, such as tab or *. Here's an example CSV file:

Figure 2.12: Partial output of a CSV file

In this activity, we will be tasked with building our own CSV reader and parser. Although it is a big task if we try to cover all use cases and edge cases, along with escape characters, for the sake of this short activity, we will keep our requirements small. We will assume that there is no escape character—meaning that if you use a comma at any place in your row, you are starting a new column. We will also assume that the only function we are interested in is to be able to read a CSV file line by line, where each read will generate a new dictionary with the column names as keys and row names as values.

Here is an example:

Figure 2.13: Table with sample data

We can convert the data in the preceding table into a Python dictionary, which would look as follows: {"Name": "Bob", "Age": "24", "Location": "California"}:

  1. Import zip_longest from itertools. Create a function to zip header, line, and fillvalue=None.

    Open the accompanying sales_record.csv file from the GitHub link (https://packt.live/2Yb6iCh) by using r mode inside a with block and check that it is opened.

  2. Read the first line and use string methods to generate a list of all the column names.
  3. Start reading the file. Read it line by line.
  4. Read each line and pass that line to a function, along with the list of the headers. The work of the function is to construct a dictionary out of these two and fill up the key:values variables. Keep in mind that a missing value should result in None.

    The partial output of this should look like this:

Figure 2.14: Partial output of the sales_record file

Note

The solution for this activity can be found on page 460.

With this, we conclude the chapter.