An upcoming C++ library for HDF5 files

I recently started working on a new C++ library for reading and writing HDF5 files. I got the idea when I was working on a few files in Python and C++ at the same time. The Python library h5py is just way more comfortable than the HDF5 C++ API. But with the new features in C++11 and C++14, I figured it should be possible to make a C++ library that is just as easy to use as h5py. And I must admit that I believe I’m on the way to making it even easier.

One design goal I have set for this project is to make the library forgiving. This means that I assume you know what you are doing and try to make it possible, even though there might be some side effects.

One example of such a side effect is that of creating a dataset in a HDF5 file and later changing its size. Because of limitations in the HDF5 standard, the original dataset will still take up space in the file, although it’s no longer in use. I believe the HDF Group designed it this way because it is hard to free already allocated space in a file without truncating it first. So they optimized for performance rather than flexibility, which is a fair choice to make. However, I assume you want flexibility and that you’d rather see that extra space taken up than seeing your program crash for trying to reuse a dataset. So this code is perfectly fine within my library:

#include <armadillo>
#include <elegant/hdf5>

using namespace elegant::hdf5;
using namespace arma;

int main() {
    File myFile("myfile.h5");
    myFile["my_dataset"] = zeros(10, 15);
    myFile["my_dataset"] = zeros(20, 25);
    return 0;
}

This opens the myfile.h5 for reading and writing and sets my_dataset to a 10×15 matrix of zeros before resetting it to a 20×25 matrix. This will, however, leave space taken up by the 10×15 matrix used in the file, even though it’s no longer accessible. Oh, and did I mention that the Armadillo library is already supported?

On the contrary, if you try to do the same in h5py, you will get a RuntimeError:

from h5py import *
from pylab import *
my_file = File("myfile.h5")
my_file["my_dataset"] = zeros((10, 15))
my_file["my_dataset"] = zeros((20, 25))
# output:
...
RuntimeError: Unable to create link (Name already exists)

As you can see, the syntax is pretty much the same, but the C++ library will be slightly more forgiving than h5py. That, in addition to a range of other nice features, is what I hope will make this new HDF5 C++ library attractive.

An experimental version will hopefully be released soon.

Installing Sumatra 0.6-dev in Ubuntu 13.10

Sumatra is a great tool for reproducibility and provenance when you are running numerical simulations. Keeping your work reproducible ensures that you and others will be able to check your results at a later time, while provenance is to keep track of where your results came from in the first place.

I won’t get into details about using Sumatra in this post, as its documentation is quite good at describing that already.

Computational Physics members

If you are a member of the Computational Physics group and have access to the computing cluster, Smaug, you don’t have to install anything. Just ssh into our dedicated Sumatra machine, named bender, and run your jobs from there. In the future, all machines will support Sumatra, but for now, we only have one dedicated machine for this task:

ssh bender

On your own machine

Otherwise, or if you want it on your own machine, you will have to install it manually. This is done by cloning the repository and running the setup.py file. This will install Sumatra with all dependencies:

hg clone https://bitbucket.org/apdavison/sumatra
cd sumatra
python setup.py install --user

Adding –user to the final command ensures that all packages are installed to your home directory, under .local/lib, so that  you don’t have to install with sudo privileges, and it makes it a bit easier to remove the package if you don’t want to keep it.

We also need the GitPython and PyYAML modules, which you may install using apt-get:

sudo apt-get install python-git python-yaml

And that’s it! You should now be able to run your projects with Sumatra.

Monitoring your unit tests without lifting a finger

I love unit testing. First of all, I think it is a good idea to test separate units of the code, but after doing so for some time, I’ve come to realize that unit tests are great for managing the software development cycle too. It all boils down to the idea that you should write tests before you write your code.

Now, this is something that I and others apparently struggle a lot with. How do you write a test for some code that doesn’t even exist yet? Even worse, how do you write a test for a piece of software that you’re not yet sure how will be used?

In computational physics, this problem arises often because we are writing code at the same time as we are trying to understand the physics, mathematics and algorithms at hand. And this is a good thing. You might want to think that one should structure all code before it is written, but this is generally a bad approach in computational physics. Especially if you’re working on something new. The reason is that you will often understand the problem and algorithms better while developing, rather than just reading about them and trying to analyze them blindly.

Keeping the tests and code healthy

But enough with the talk, let’s just assume that you are convinced that you should (or have to) implement some unit tests. At one point you are likely to be in a position where you find it tiresome to have to go into that folder where the tests are defined and run them manually. This is where Jenkins comes in to play.

Continue reading Monitoring your unit tests without lifting a finger

Working with percolation clusters in Python

We’re working on a new project in FYS4460 about percolation. In the introduction of this project, we are given a few commands to help us demonstrate a few properties of percolation clusters using MATLAB.

As the Python-fan I am, I of course had to see if I could find equivalent commands in Python, and thankfully that was quite easy. Below I will summarize the commands that will generate a random matrix of filled and unfilled areas, label each cluster in this matrix and calculate the area of each such cluster. Finally, we’ll draw a bounding box around the largest cluster.

Continue reading Working with percolation clusters in Python

Adjusting to the new version of Pylab and Mayavi on Ubuntu 12.04

It seems the IPython and Pylab packages has also been updated in 12.04 and thus removing the old ipython -wthread flag that would ensure Mayavi plots to be run in a separate thread. Running with the flag causes this error to show up:

[TerminalIPythonApp] Unrecognized flag: '-wthread'

Without this flag, the Mayavi plots lock up the UI and hangs. If you want to get the possibility back to rotate and play around with the plots, just start IPython the following way from now on:

ipython --pylab=qt

This will launch IPython with the Qt backend and threading. Using only –pylab does not include threading. For easy and quick access, add the following to a file named .bashrc in your home folder:

alias pylab='ipython --pylab=qt'

From now on you can launch IPython just by typing

pylab

in a terminal.

Using the same script on installs with different EPD versions

In the newest version of Enthought’s Python Distribution (EPD) on Ubuntu, the plotting package has been moved from enthought.mayavi.mlab to the shorter and more general mayavi.mlab. This does however mean that if you, like me, need to work with different versions of EPD on multiple systems, will experience the following error from time to time:

ImportError: No module named enthought.mayavi.mlab

Now, to avoid switching the import statement every time you switch systems, you can make Python check if one of the versions is installed during import. If it is not, we’ll tell it to try the other. This is done in this simple command:

try:
    from enthought.mayavi.mlab import *
except ImportError:
    from mayavi.mlab import *

Just replace any other similar import statements the same way and your code should once again be working across all your installations.

Python deleted my vector values

Sometimes scripting languages can be a real annoyance. Why? Because when you get as much help as you do with for instance Python, you also lose a lot of control.

Being used to scripting languages like PHP, I made the funny mistake today of initializing a set of arrays like:

a = v = r = zeros((n,2),float)

This seemed like a really good idea, saving me from typing two extra lines(!). As the sucker I am for short code I was happy with my newfound shortcut. What I didn’t realize is that Python, in comparison to PHP, treats assignments like these as pointers instead of variables.

I believed this would create three arrays with a lot of zeros in two dimensions as I would expect from PHP, but the result was that I instead created one array with loads of zeros in two dimensions, with three pointers a, v and r all pointing to the same array.

When I then started setting the values for each of these arrays using Euler’s method, the result was that I got a lot of nonsense in what I thought was three separate arrays.

As a reminder to myself and everyone else out there; Python is not PHP. If you want to initialize three arrays like this in Python, you’ll have to stick to  the long version:

a = zeros((n,2),float)
r = zeros((n,2),float)
v = zeros((n,2),float)

Or, you could at least save yourself from having to edit each assignment if you ever need to change the code by writing:

a = zeros((n,2),float)
r = a.copy()
v = a.copy()