Skip to content

Reading and writing data

A code might require input data and it might also create data for outputting. Therefore it is useful to be able to read and write data from and to a file. This is often referred to as I/O, standing for "Input/Output".

Files can either contain plain ASCII text, i.e., text that is readable if the file is opened with a standard text editor, or information that is in a binary format (generally this is not readable by standard editor unless the format is known).

Plain text files are useful in that they are human readable, although for the same amount of information they will generally be larger in size (i.e., memory taken up on disc) than an equivalent binary file.

Note

When reading and writing it is useful to know your directory structure and be explicit about where you want to save to/read from. This means giving the full path, including directories (and drive letter on Windows), of the file. For example, you might refer to files with:

filename = "C:/My Documents/myfile.txt"

to make sure you are using the file on the C-drive, in the "My Documents" folder, and with the name myfile.txt. Note that the slashes are in the opposite direction (forward slashes) to the way they are normally shown in Windows. Equivalently you could use backslashes as:

filename = "C:\\My Documents\\myfile.txt"
# or
filename = r"C:\My Documents\myfile.txt"

which both stop Python interpreting backslashes (\) in a string as an escape character for the following letter (e.g., \n in a string means new line).

Another option is to use the pathlib built-in module to construct Path objects that can be used instead of strings, e.g.,:

from pathlib import Path
filename = Path("/My Documents/myfile.txt")

On Windows, using pathlib will often work with or without the drive supplied if the file is locally stored.

It is useful to save to filenames that do not contain spaces.

Basic reading and writing to file (ASCII text only)

The built-in Python function open provides a way to open files and make them ready for reading their content or writing to them.

When using open it requires the name of the file to open and the "mode", i.e., whether to open the file for reading (mode "r"), writing (mode "w"), or appending (mode "a"). It returns a file object.

Reading

Suppose you have a plain text file called mydata.txt in your current directory containing some numerical data (this will be used in further examples):

# x y
1 9.8
2 10.3
3 12.4
4 13.2
5 14.7
6 16.1
7 17.2
8 18.7
9 20.1
10 21.3

The first line starts with a # and is a comment line a giving the "names" of the columns. The two columns can be read into two lists using the following code:

fp = open("mydata.txt", "r")  # open mydata.txt for "r"eading
x = []  # empty list to contain x data
y = []  # empty list to contain y data
for line in fp.readlines():  # loop through each line in the data
    if line[0] == "#":
       # skip lines that start with a "#"
       continue
    data = line.split()  # split the line on any whitespace
    x.append(float(data[0]))  # convert string into float
    y.append(float(data[1]))

fp.close()  # close the file

print(x)
print(y)
[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]
[9.8, 10.3, 12.4, 13.2, 14.7, 16.1, 17.2, 18.7, 20.1, 21.3]

In the above code, readlines is a method of the file object. It goes through the file and returns a list, where each entry is a string containing a line from the file. One can avoid reading in the entire file at once by simply using for line in fp:; this saves time and memory when files are very large.

When reading data in this way you must know what the data file looks like, i.e., you need to know that comment lines start with a # and that it contains two columns of numbers.

Note

Here, the file object variable has been named fp. I have used this as a hangover from writing C code where it is often used to mean "file pointer". Any variable name can be given to the file object.

The open function can be used as a context manager. This is basically just a way of making sure the resource, in this case the open file, is closed after use. Above, the file was explicitly closed using the close method, fp.close(), but you do not need to close the file if instead you use the with statement:

x = []  # empty list to contain x data
y = []  # empty list to contain y data
with open("mydata.txt", "r") as fp:
    # indent within the with statement
    for line in fp:  # loop through each line in the data
        if line[0] == "#":
            # skip lines that start with a "#"
            continue
        data = line.split()  # split the line on any whitespace
        x.append(float(data[0]))  # convert string into float
        y.append(float(data[1]))

# exited the with statement, but don't need to close fp

Reading binary data can be done by opening the file with the "rb" mode instead of "r" and reading the entire contents using the read method of the file object. However, converting the read-in data to something that is usable within Python is trickier as you have to know exactly the layout and memory size for the data stored within the file. We will not cover this here.

Writing

To write to a plain text file you again have to open a file, but this time with the mode set to write, "w". Once the file is open you can then use the write method of the file object to add data to the file. You can only write out string data, so any numbers must be converted to strings.

# create some data
datax = range(10, 21)
datay = [2.3 - 4.5 * x + 5.4 * x ** 2 for x in datax]

filename = "newfile.txt"

fp = open(filename, "w")
for i in range(len(datax)):  # loop over the data
    fp.write("{} {}\n".format(datax[i], datay[i]))

fp.close()  # close the file

After this program is run, the contents of file newfile.txt will look like:

10 497.3
11 606.2
12 725.9
13 856.4
14 997.7
15 1149.8
16 1312.7
17 1486.4
18 1670.9
19 1866.2
20 2072.3

In the output format string "{} {}\n".format(datax[i], datay[i]) it separates the two numbers by a single space and ends with the newline character \n. If the \n is not added the numbers would all be written out on the same line. Values can be separated by multiple spaces, or tabs by using the tab character \t.

Warning

If you open a file for writing that already exists it will overwrite the existing file and its contents will be gone. If you want to make sure that the file does not exist before writing you can do something like:

import os  # import the built-in os module

filename = "newfile.txt"  # name of file to write to
if os.path.isfile(filename):
    print("Warning: you are trying to write to an existing file.")
else:
    fp = open(filename, "w")
    ...

Instead of the write method, the built-in print function can also be used to write to a file. The above code could be replicated with:

filename = "newfile.txt"

fp = open(filename, "w")
for i in range(len(datax)):  # loop over the data
    print(datax[i], datay[i], file=fp)

fp.close()  # close the file

When using print it automatically converts datax[i] and datay[i] to their string representations, automatically adds a space separating them (the separator can be altered using the sep keyword argument), and automatically adds a newline character.

Appending

Instead of writing to an entirely new file you can append data to an existing file. To do this you would open the file with the append mode "a". If the file does not already exist a new file will be created. If the file does exist anything you write to it will be added to the end.

Using pathlib

The built-in pathlib module provides a useful Path object for defining file or directory paths, rather than using strings.

The Path object has methods for reading and writing from and to text files. For example, to read the mydata.txt file defined above you could do:

from pathlib import Path

p = Path("mydata.txt")

# read all the contents of the file
contents = p.read_text()

This would read in all the file contents to a string variable, so it would still have to be parsed in some way, e.g.:

x = []
y = []
for line in contents.split("\n"):  # loop through each line in the data
    if line[0] == "#":
        # skip lines that start with a "#"
        continue
    data = line.split()  # split the line on any whitespace
    x.append(float(data[0]))  # convert string into float
    y.append(float(data[1]))

A Path object can also just be passed to the open function, or other IO-functions such as those in NumPy, as if it were a string, e.g.,

from pathlib import Path

p = Path("myfile.txt")
with open(p, "w") as fp:
    ...

Pickling

If your data is purely numerical then writing to a plain text file is a simple way to store it. However, you may want to save Python objects instead. There is a built-in Python module called pickle that allows (most) objects to be saved in a binary file.

If you define a simple class like:

class MyData:
    def __init__(self, x, y, name):
        self.x = x
        self.y = y
        self.name = name

    def __str__(self):
        return "MyData: '{}'\n x: {}\n y: {}".format(self.name, self.x, self.y)

and then create and instance of that class:

x = [0.7, 0.8, 0.9, 1.0, 1.1]
y = [-10, -9, -8, -7, -6]
mydata = MyData(x, y, "Lab1")

it can be saved in a pickle file using the dump method:

import pickle

filename = "mydata.pkl"
fp = open(filename, "wb")  # open writing to binary file
pickle.dump(mydata, fp)
fp.close()

This data can then be read back in using the load function, e.g.,

# read in the MyData object
filename = "mydata.pkl"
fp = open(filename, "rb")  # open reading from binary file
data = pickle.load(fp)
print(data)
MyData: 'Lab1'
 x: [0.7, 0.8, 0.9, 1.0, 1.1]
 y: [-10, -9, -8, -7, -6]

Note

To load in a pickled object, the objects class must be available, i.e., defined in the script that you are importing into, or in an importable module, so that it can be reconstructed.

JSON

A plain text file format that allows you to save additional meta data is JSON. A JSON file has the format of a dictionary object, so data (numbers, lists, string, or even further dictionaries) stored in a dictionary can be output as a JSON file. Dictionaries have keys and values; therefore the keys can provide information (meta data) about the data that is stored.

The data can be written to a text file using the dump function from the built-in json Python module:

import json  # import json module

# create dictionary to store data
data = {}
data["x values"] = [0.7, 0.8, 0.9, 1.0, 1.1]
data["y values"] = [-10, -9, -8, -7, -6]
data["name"] = "Lab1"

# open file for writing
filename = "mydata.txt"
fp = open(filename, "w")

# write json file
json.dump(data, fp, indent=2)  # "indent=2" indents each line by 2 spaces

fp.close()

The output file is human readable and in this case contains:

{
  "x values": [
    0.7,
    0.8,
    0.9,
    1.0,
    1.1
  ],
  "y values": [
    -10,
    -9,
    -8,
    -7,
    -6
  ],
  "name": "Lab1"
}

A JSON file can be read back in using the load function, e.g.,:

import json

# open file for reading
filename = "mydata.txt"
with open(filename, "r") as fp:
    data = json.load(fp)
print(data)
{'x values': [0.7, 0.8, 0.9, 1.0, 1.1], 'y values': [-10, -9, -8, -7, -6], 'name': 'Lab1'}

NumPy

NumPy can be used to both save and read plain text data, or pickle objects in binary files.

Reading and writing text files

Considering our original mydata.txt file, this could be read in as a NumPy ndarray using the loadtxt function, e.g.,:

import numpy as np

filename = "mydata.txt"
data = np.loadtxt(filename, comments="#")
print(data)
[[ 1.   9.8]
 [ 2.  10.3]
 [ 3.  12.4]
 [ 4.  13.2]
 [ 5.  14.7]
 [ 6.  16.1]
 [ 7.  17.2]
 [ 8.  18.7]
 [ 9.  20.1]
 [10.  21.3]]

Rather than taking a file object, loadtxt can just be passed the file name. Lines starting with a particular character, in this case #, can be ignored using the comments keyword argument.

If the data file contained columns with values separated by commas (often called comma separated value, or CSV, files), then the delimiter keyword argument could be used, e.g., data = np.loadtxt(filename, comments="#", delimiter=",").

More control, including converting particular columns to certain data types is available, with the finest grain of control found using the genfromtxt function.

1D and 2D NumPy arrays can be saved to text files using the savetxt function, e.g.,:

import numpy as np

# data to save
data = np.array([[0.1, 10.0], [0.2, 11.0], [0.3, 12.0], [0.4, 13.0]])

filename = "mydata.txt"
np.savetxt(filename, data)

The output format (i.e., the number of decimal places on float numbers) can be set using the fmt keyword argument, e.g., np.savetxt(filename, data, fmt="%.5f") would output floats with 5 decimal places. The delimiter between the output values can be set (by default a space), and header and footer text, preceded by a comment character, can also be set.

Binary files

NumPy arrays can be saved as binary files containing pickled data using the save function and then read back in using the load function, e.g.,:

import numpy as np

# our data
data = np.array([[0.1, 10.0], [0.2, 11.0], [0.3, 12.0], [0.4, 13.0]])

# save the data
filename = "mydata.npy"  # npy is the standard file extension name
np.save(filename, data)

# read in the data
newdata = np.load(filename)
print(newdata)
[[ 0.1 10. ]
 [ 0.2 11. ]
 [ 0.3 12. ]
 [ 0.4 13. ]]

By default, the save function assumes you are saving a NumPy array object. However, you can get it to save other Python objects by explicitly telling it to allow pickling. For example, if you had a simple class like:

class DataPoint:
    def __init__(self, x, y, z):
        # store copies of x, y and z in the class
        self.x = x
        self.y = y
        self.z = z

and generated a list containing many DataPoint objects:

import numpy as np

mydata = [DataPoint(x, y, np.sqrt(x ** 2 + y ** 2)) for x, y in np.random.rand(10, 2)]

then to save this you would use:

filename = "mydata.npy"
np.save(filename, mydata, allow_pickle=True)

If you were to then read this data in:

readdata = np.load(filename, allow_pickle=True)

then readdata will contain the original DataPoint objects.

Note

The load function will always load the data as a NumPy array object. So, in the example above, while mydata was a list, readdata will instead be a NumPy array, although it will still contain the DataPoint objects.

Pandas

While not covered in this tutorial, Pandas is an advanced Python module primarily for holding data tables. It has methods for reading and writing to a variety of file formats including plain text, CSV and Excel spreadsheets.