PKGW: rubbl_casatables 0.8

2024 August 14

Yesterday I put out the first release in the 0.8.x series of the rubbl_casatables Rust crate, which provides access to the “casatable” data container file format used by the CASA radio interferometry package (sometimes called the CASA Table Data System). CASA uses the casatable format for virtually all of its data files, most notably the MeasurementSets that store interferometric visibilities. This release is primarily the work of @d3v-null at Curtin, who undertook the difficult and tedious project of updating the backing codebase to use the casacore 3.5.0 implementation of the casatables format.

To understand the niche occupied by rubbl_casatables, it’s helpful to be careful about how we discuss CASA’s data files. In particular, the distinction between a MeasurementSet and a casatable is important.

When I discuss the casatable format, I’m referring to what I call a container format or sometimes “serialization format”. It defines how to store various kinds of data into files on disk, but is silent on the why: what the data actually mean. File archive formats like Zip are perhaps the most obvious examples of container formats: the Zip specification tells you how to pack various files into a Zip archive, but it doesn’t (and can’t, and shouldn’t) make any claims about the meaning of those files. Other file formats, like Java’s JAR files, actually use the Zip container format for their underlying storage, then apply additional layers of semantics. For instance, a JAR file should contain a Zip entry called META-INF/MANIFEST.MF whose contents and meaning are defined by the JAR specification.

The MeasurementSet specification, to be contrasted with casatables, is closer to the JAR format in that works at a more semantic level, defining a way to represent interferometry data in particular. This representation can then be captured in a casatable, but it can also potentially be mapped to other serialization formats like Arrow, Zarr, or Parquet (see, e.g., the arcae Python package).

In astronomy, we often talk about FITS as an image format, and indeed the “I” in “FITS” does stand for “image”, but modern FITS is really a container format as well. After all, FITS files can contain multiple Header Data Units, each of which can contain an N-dimensional array, or a binary table, or interferometry data, or all sorts of other custom data structures. I would claim that the longevity of FITS stems in large part from its evolution from a pure image format to a more flexible container format with a very simple specification, which makes it easy to implement FITS readers and writers in a variety of languages. For instance, I also wrote a small rubbl_fits crate that exposes FITS I/O to Rust. It’s beta-quality at best, but I’m still pretty confident that it can correctly understand the structure of any valid FITS file that you throw at it.

The casatables format is another member of the astronomical data container family. In my view it’s most similar to HDF5: they’re both relatively “modern” formats for complex scientific data, designed as extensible container formats from the start.

The other thing that these two formats have in common — which honestly infuriates me in both cases — is that their byte-level serializations are extremely complex and at best weakly documented; effectively the only way to use these container formats is through extremely gnarly C++ libraries provided by the format developers. HDF5 does at least have a written specification of the on-disk format, while to the best of my knowledge there isn’t one at all for casatables. But in either case, if you wanted to implement a parser for the format from scratch in another programming language, I strongly suspect that you’d basically have to reverse-engineer the C++ codebase.

If you ask me, the cardinal virtue of a container format is to have a minimally-complex, clear specification that you can imagine implementing from scratch if needed. Once again, I think this is why FITS has lasted for so long. It’s not that I’m worried about literally losing the ability to compile the relevant implementation libraries, although C++ code in particular seems to need regular maintainance just to stay buildable: the language has an unfortunate combination of high complexity and constant evolution as people try to patch up its many flaws. It’s a more general sense of unease with the design of such formats. One thing I’ll point to is that both casatables and HDF5 have acquired pluggable I/O backends, which essentially formalize the requirement that the only way to reliably decode datasets is through the official C++ libraries. Casatables has extensible “storage managers” like Dysco, while HDF5 has things like virtual file layers and virtual object layer connectors.

Compounding the complexity in CASA is that the casatables container format implementation is embedded within the rest of the CASA C++ library ecosystem, which makes things even more baroque. Even the “streamlined” casacore codebase consists of, by my quick estimate, around 2,300 source files, with dependencies on a number of external libraries like wcslib, fftw3, HDF5, and ncurses (!). If you just want to understand what’s in a casatable file tree, you need to build this whole suite of libraries, although to be fair you only need to link with a subset of them. But still, this is ridiculously onerous for what should be a low-level operation: understand the contents of this container.

If you use CASA, your analysis sessions are creating casatables data all over the place: MeasurementSets, calibration files, images, source lists, and probably other kinds of data as well. With the standard CASA software, if you want to interact with these files, you need to use these unwieldy C++ libraries — either directly, or using wrappers that rely on them, like the casacore Python bindings. I can say from experience that doing so is pretty unpleasant, which in turn limits the development of a software ecosystem around these kinds of data. I’m absolutely certain this is why we see the various efforts to express MeasurementSet data to other serialization formats like Arrow that I mentioned above.

This is, finally, where rubbl_casatables comes in. This Rust crate provides support for using the casatables container format in a self-contained library that provides a hopefully-clean API.

It accomplishes this through brute force: it bundles the subset of the casacore C++ code needed to work with casatables data, and nothing more. There are a small number of modifications to make the codebase more “standalone”, but in the end only a few are needed. Compared to stock casacore, there are “only” 783 C++ source files to compile, and no external dependencies.

The design of the Rust packaging ecosystem plays a major role here. While Rust crates are basically reusable libraries in the C/C++ tradition, you don’t compile them into shared libraries and install them into /usr/lib64. Rust, like Go, has a static-first model. These languages strongly encourage you to only produce binary executables, not shared libraries. If your executable depends on another package, you compile it directly into the executable, rather than installing it as a separate shared library that your executable then depends on. This makes for some relatively large binaries (you're including all of the code that would be separated out into a library) and compile times (you're compiling all of that code from scratch, as well), but in my experience it’s absolutely the right paradigm. We’ve come a long way from DLL Hell, but every shared library dependency still adds a number of failure modes to a software deployment.

My main use of rubbl_casatables is in a companion project called rubbl-rxpackage, which implements a few low-level data-processing tasks that are not available in mainline CASA. The most interesting one is the key data transformation needed to support the peeling workflow that I’ve developed. Another good example is a utility called spwglue the merges together adjacent spectral windows in a MeasurementSet. My first implementation of spwglue was in Python using the casacore Python bindings; porting to Rust sped it up by a factor of twenty.

More broadly, these kinds of programs could be implemented in C++, but are much, much less pleasant to do so. Besides the heaviness of the casacore library dependency, I cannot emphasize enough how much I dislike working in C++ (see timely toot). C++ was a step forward in its time, but “its time” was thirty years ago. We’ve learned so much about designing programming languages since then, both in terms of ergonomics and safety. Obviously there are immense amounts of legacy code and expertise that we can’t and shouldn’t just throw away, but there are huge benefits when you can move to better tools. Rust makes me actually enjoy writing systems-level code, while (and in no small part because) it simultaneously gives me much more confidence in that code’s correctness.

As I mentioned at the top, the new release of rubbl_casatables primarily upgrades the backing C++ code from version 3.1.1 of casacore to version 3.5.0. To the best of my knowledge, this shouldn’t affect much in the way of major functionality, but it does include a bunch of that necessary C++ maintenance that I mentioned above. @d3v-null undertook this project to get the codebase building on additional CPU targets, and put in a lot of effort to untangle the casacore changes necessary to perform the update — see issue #345 for all of the gory details. It took me a very long time to follow up on all their work, which is something I hope to do better about going forward.

The only really tricky thing I did was address a longstanding weakness dealing with some uses of uninitialized memory in the Rust side of the casatables implementation. We had some code that would allocate uninitialized array buffers and then pass them off to the casacore C++ code to fill with data. This is the kind of thing that, coming from a C/C++ background, is second nature, but it turns out that you actually have to do this with a lot more care than you might think. Rust RFC 2930 has a good discussion of the issues at play; see also this blog by Ralf Jung. This is the kind of thing that makes me really thankful that people involved in Rust are super fastidious about getting right, while there’s so much legacy C++ out in the world that you can, at best, only hope that a nontrivial project does the right thing consistently.

You might wonder about this “Rubbl” name. Rubbl is an umbrella collection of my foundational Rust projects relating to astrophysics (“Rust + Hubble”), mostly data formats: casatables, FITS, and MIRIAD. I hope that it at least has the possibility of one day growing into something truly foundational like Astropy, but right now the casatables implementation is basically the only thing that gets regular usage. My plan is that if I find myself needing to write data analysis code that would be in C or C++, I’ll try to do it in Rust and extend Rubbl as needed. So far, that need has only come up rarely, although the rubbl-rxpackage tools are great examples of times that it has. It’s been very pleasing that the existing code has at least been useful enough for people like @d3v-null to get involved.

The DOI of this new release is 10.5281/zenodo.13315460 (automatically registered in Zenodo with Cranko). The API docs for rubbl_casatables may be found on docs.rs.

Questions or comments? For better or worse this website isn’t interactive, so send me an email or, uh, Toot me.

To get notified of new posts, try subscribing to my lightweight newsletter or my RSS/Atom feed. No thirsty influencering — you get alerts about what I’m writing; I get warm fuzzies from knowing that someone’s reading!

Later: Layered Deployment Metadata

Earlier: Addendum: Seven-Figure Scientific Software Projects

See a list of all posts.

View the revision history of this page.