Tech Blog

Python Parsing

MAB_headshot_circle

General parsing of data files with Python 3

Overview

At Evergreen we regularly undertake projects in which we need to analyse data generated by the bespoke acquisition systems we build for our clients. This could be, for example, a data related to the control system of a model wave energy converter.

In this post we will look at some of the tricks we have picked up over the course of many testing campaigns. All of our analysis routines are implemented in Python 3.

Simple example

The easiest approach is simply to use numpy to load the contents of a data file into a matrix from which we can pick out the columns of data we want to use.

Suppose our data acquistion systems records the elapsed time and a velocity signal. We would have a tab-separated file that looks like:

Time (s)velocity (mm/s)
0.000014.794255
0.007814.930482
0.015615.064894
0.023415.197459
0.031215.328144
…..…..
59.9609-1.005011
59.9688-0.905971
59.9766-0.808469
59.9844-0.712527
59.9922-0.618163

The simplest means of reading the file is using the numpy loadtxt function to create a matrix with the time in the first column and the velocity in the second column:

import numpy as np
import matplotlib as plt

fname = 'example.txt'

data = np.loadtxt(fname, skiprows=1)

plt.figure()
plt.plot(data[:,0], data[:,1])
plt.xlabel('t (s)')
plt.ylabel('vel (mm/s)')
plt.grid()
plt.savefig('img1.png', format='png', dpi=500)
plt.show()

The resulting figure is as expected:

 

Time history of velocity
Time history of velocity

General Example

For files with a handful of columns, the simple methods is effective. Numerical indexing of the columns (data[:,0] etc) has several drawbacks 1. Our files often have as many as 30 columns which makes identification by number error-prone. 2. Each time we want to parse a file we have have a long list picking out the columns we want to use – this a red-flag for copy-paste errors. 2. During an experimental series we sometimes add or remove columns: for the first two days of testing column 5 could be velocity but then column 6 thereafter. Writing code to automatically take care of these cases is time-consuming.

The best strategy we have found is to use the header line of the file to name each column in parsed output. For example, we would like to write:

plot(data.time, data.velocity)

and not have to worry about where in the file these values actually are. To achieve this, we can use the fromrecords (docs) function in numpy along with the loadtxt function we have already seen. As we want to use the header line for identification, it is helpful to make the line easier to automatically read – our files now looks like:

time_svelocity_mm_s
0.000014.794255
0.007814.930482
0.015615.064894
0.023415.197459
0.031215.328144
…..…..
59.9609-1.005011
59.9688-0.905971
59.9766-0.808469
59.9844-0.712527
59.9922-0.618163

We have found leaving the units also helps to avoid mistakes, particularly when working between projects that have different conventions. Again, this is something that can cause hard-to-find bugs when using the ‘simple’ scheme above.

To parse the file we start by reading and storing the header, we load the contents then associate each column with its title.

def parse_file(fname, print_names=False):
    '''Parse a file using the header line as the 
    
    Inputs:
    fname - filename to parse
    print_names - print the column headers
    
    Outputs:
    data - file contents
    '''

    # Read the header line into columns
    with open(fname) as f:
        cols = f.readline().rstrip().split('\t')
        
    if print_names:
        [print(c) for c in cols]
        
    # Parse the data in the file
    raw = np.loadtxt(fname, skiprows=1)
    
    # Associate column with title
    data = np.core.records.fromrecords(raw, names=cols)
    
    return data

The print_names argument to the function is there so we can see what our column names are without having to open the original file and is made optional to reduce unnecessary console output once our analysis scripts are established. You can, of course, manipulate the column titles in parse_file to suit the output you want.

Pulling everything together, we have

fname = 'general_example.txt'
data = parse_file(fname, print_names=True)

plt.figure()
plt.plot(data.time_s, data.velocity_mm_s)
plt.xlabel('t (s)')
plt.ylabel('velocity (mm/s)')
plt.grid()
plt.show()

which gives the same result as before.

Summing up

In this post, we have taken advantage of routines already implemented in numpy to make dealing with our data files easier. We at Evergreen have found that the general technique has proven robust and certainly helped us to save time tracing bugs relating to identifying the correct data within large files.