General parsing of data files with Python 3
At Evergreen we regularly undertake projects in which we need to analyse data generated by the bespoke acquisition systems we build for our clients. This could be, for example, a data related to the control system of a model wave energy converter.
In this post we will look at some of the tricks we have picked up over the course of many testing campaigns. All of our analysis routines are implemented in Python 3.
The easiest approach is simply to use numpy to load the contents of a data file into a matrix from which we can pick out the columns of data we want to use.
Suppose our data acquistion systems records the elapsed time and a velocity signal. We would have a tab-separated file that looks like:
|Time (s)||velocity (mm/s)|
The simplest means of reading the file is using the numpy
loadtxt function to create a matrix with the time in the first column and the velocity in the second column:
import numpy as np import matplotlib as plt fname = 'example.txt' data = np.loadtxt(fname, skiprows=1) plt.figure() plt.plot(data[:,0], data[:,1]) plt.xlabel('t (s)') plt.ylabel('vel (mm/s)') plt.grid() plt.savefig('img1.png', format='png', dpi=500) plt.show()
The resulting figure is as expected:
For files with a handful of columns, the simple methods is effective. Numerical indexing of the columns (
data[:,0] etc) has several drawbacks 1. Our files often have as many as 30 columns which makes identification by number error-prone. 2. Each time we want to parse a file we have have a long list picking out the columns we want to use – this a red-flag for copy-paste errors. 2. During an experimental series we sometimes add or remove columns: for the first two days of testing column 5 could be velocity but then column 6 thereafter. Writing code to automatically take care of these cases is time-consuming.
The best strategy we have found is to use the header line of the file to name each column in parsed output. For example, we would like to write:
and not have to worry about where in the file these values actually are. To achieve this, we can use the
fromrecords (docs) function in
numpy along with the
loadtxt function we have already seen. As we want to use the header line for identification, it is helpful to make the line easier to automatically read – our files now looks like:
We have found leaving the units also helps to avoid mistakes, particularly when working between projects that have different conventions. Again, this is something that can cause hard-to-find bugs when using the ‘simple’ scheme above.
To parse the file we start by reading and storing the header, we load the contents then associate each column with its title.
def parse_file(fname, print_names=False): '''Parse a file using the header line as the Inputs: fname - filename to parse print_names - print the column headers Outputs: data - file contents ''' # Read the header line into columns with open(fname) as f: cols = f.readline().rstrip().split('\t') if print_names: [print(c) for c in cols] # Parse the data in the file raw = np.loadtxt(fname, skiprows=1) # Associate column with title data = np.core.records.fromrecords(raw, names=cols) return data
print_names argument to the function is there so we can see what our column names are without having to open the original file and is made optional to reduce unnecessary console output once our analysis scripts are established. You can, of course, manipulate the column titles in
parse_file to suit the output you want.
Pulling everything together, we have
fname = 'general_example.txt' data = parse_file(fname, print_names=True) plt.figure() plt.plot(data.time_s, data.velocity_mm_s) plt.xlabel('t (s)') plt.ylabel('velocity (mm/s)') plt.grid() plt.show()
which gives the same result as before.
In this post, we have taken advantage of routines already implemented in numpy to make dealing with our data files easier. We at Evergreen have found that the general technique has proven robust and certainly helped us to save time tracing bugs relating to identifying the correct data within large files.