PKGW https://newton.cx/~peter/ Zola en Wed, 14 Aug 2024 11:50:46 -0400 rubbl_casatables 0.8 Wed, 14 Aug 2024 11:50:46 -0400 https://newton.cx/~peter/2024/rubbl-casatables-0-8/ https://newton.cx/~peter/2024/rubbl-casatables-0-8/ <p>Yesterday I put out the <a href="https://github.com/pkgw/rubbl/releases/tag/rubbl_casatables%400.8.0">first release in the 0.8.x series</a> of the <a href="https://github.com/pkgw/rubbl"><code>rubbl_casatables</code></a> <a href="https://rust-lang.org/">Rust</a> crate, which provides access to the “casatable” data container file format used by the <a href="https://casa.nrao.edu/">CASA</a> radio interferometry package (sometimes called the <a href="https://casacore.github.io/casacore-notes/255.html">CASA Table Data System</a>). CASA uses the casatable format for virtually all of its data files, most notably the <a href="https://casadocs.readthedocs.io/en/stable/notebooks/casa-fundamentals.html#MeasurementSet-Basics">MeasurementSets</a> that store interferometric visibilities. This release is primarily the work of <a href="https://github.com/d3v-null/">@d3v-null</a> at <a href="https://astronomy.curtin.edu.au/">Curtin</a>, who undertook the difficult and tedious project of updating the backing codebase to use the <a href="https://github.com/casacore/casacore/releases/tag/v3.5.0">casacore 3.5.0</a> implementation of the casatables format.</p> <span id="continue-reading"></span> <p>To understand the niche occupied by <a href="https://github.com/pkgw/rubbl"><code>rubbl_casatables</code></a>, it’s helpful to be careful about how we discuss CASA’s data files. In particular, the distinction between a MeasurementSet and a casatable is important.</p> <p>When I discuss the casatable format, I’m referring to what I call a <a href="https://en.wikipedia.org/wiki/Container_format">container format</a> or sometimes “serialization format”. It defines <em>how</em> to store various kinds of data into files on disk, but is silent on the <em>why</em>: what the data actually mean. File archive formats like <a href="https://en.wikipedia.org/wiki/ZIP_(file_format)">Zip</a> are perhaps the most obvious examples of container formats: the Zip specification tells you how to pack various files into a Zip archive, but it doesn’t (and can’t, and shouldn’t) make any claims about the meaning of those files. Other file formats, like Java’s <a href="https://en.wikipedia.org/wiki/JAR_(file_format)">JAR</a> files, actually use the Zip container format for their underlying storage, then apply additional layers of semantics. For instance, a JAR file should contain a Zip entry called <code>META-INF/MANIFEST.MF</code> whose contents and meaning are defined by the JAR specification.</p> <p><a href="https://casadocs.readthedocs.io/en/stable/notebooks/casa-fundamentals.html#MeasurementSet-v2">The MeasurementSet specification</a>, to be contrasted with casatables, is closer to the JAR format in that works at a more semantic level, defining a way to represent interferometry data in particular. This representation can then be captured in a casatable, but it can also potentially be mapped to other serialization formats like <a href="https://arrow.apache.org/">Arrow</a>, <a href="https://zarr.dev/">Zarr</a>, or <a href="https://parquet.apache.org/">Parquet</a> (see, e.g., the <a href="https://github.com/ratt-ru/arcae/">arcae</a> Python package).</p> <p>In astronomy, we often talk about <a href="https://en.wikipedia.org/wiki/FITS">FITS</a> as an image format, and indeed the “I” in “FITS” does stand for “image”, but modern FITS is really a container format as well. After all, FITS files can contain multiple <a href="https://docs.astropy.org/en/stable/io/fits/api/hdus.html">Header Data Units</a>, each of which can contain an N-dimensional array, or a binary table, or <a href="https://fits.gsfc.nasa.gov/registry/fitsidi.html">interferometry data</a>, or <a href="https://fits.gsfc.nasa.gov/fits_registry.html">all sorts of other custom data structures</a>. I would claim that the longevity of FITS stems in large part from its evolution from a pure image format to a more flexible container format with a very simple specification, which makes it easy to implement FITS readers and writers in a variety of languages. For instance, I also wrote a small <a href="https://docs.rs/rubbl_fits/"><code>rubbl_fits</code></a> crate that exposes FITS I/O to Rust. It’s beta-quality at best, but I’m still pretty confident that it can correctly understand the structure of any valid FITS file that you throw at it.</p> <p>The casatables format is another member of the astronomical data container family. In my view it’s most similar to <a href="https://www.hdfgroup.org/">HDF5</a>: they’re both relatively “modern” formats for complex scientific data, designed as extensible container formats from the start.</p> <p>The other thing that these two formats have in common — which honestly infuriates me in both cases — is that their byte-level serializations are extremely complex and at best weakly documented; effectively the <em>only</em> way to use these container formats is through extremely gnarly C++ libraries provided by the format developers. HDF5 does at least have a <a href="https://docs.hdfgroup.org/hdf5/v1_14/_f_m_t3.html">written specification of the on-disk format</a>, while to the best of my knowledge there isn’t one at all for casatables. But in either case, if you wanted to implement a parser for the format from scratch in another programming language, I strongly suspect that you’d basically have to reverse-engineer the C++ codebase.</p> <p>If you ask me, the cardinal virtue of a container format is to have a minimally-complex, clear specification that you can imagine implementing from scratch if needed. Once again, I think this is why FITS has lasted for so long. It’s not that I’m worried about literally losing the ability to compile the relevant implementation libraries, although C++ code in particular seems to need regular maintainance just to stay buildable: the language has an unfortunate combination of high complexity and constant evolution as people try to patch up its many flaws. It’s a more general sense of unease with the design of such formats. One thing I’ll point to is that both casatables and HDF5 have acquired pluggable I/O backends, which essentially formalize the requirement that the only way to reliably decode datasets is through the official C++ libraries. Casatables has extensible “storage managers” like <a href="https://github.com/aroffringa/dysco">Dysco</a>, while HDF5 has things like <a href="https://docs.hdfgroup.org/hdf5/v1_12/_v_f_l.html">virtual file layers</a> and <a href="https://docs.hdfgroup.org/hdf5/v1_14/_h5_v_l__u_g.html">virtual object layer connectors</a>.</p> <p>Compounding the complexity in CASA is that the casatables container format implementation is embedded within the rest of the CASA C++ library ecosystem, which makes things even more baroque. Even the “streamlined” <a href="https://github.com/casacore/casacore/">casacore</a> codebase consists of, by my quick estimate, around 2,300 source files, with dependencies on a number of external libraries like <a href="https://www.atnf.csiro.au/people/mcalabre/WCS/index.html">wcslib</a>, <a href="https://www.fftw.org/">fftw3</a>, <a href="https://www.hdfgroup.org/">HDF5</a>, and <a href="https://invisible-island.net/ncurses/">ncurses</a> (!). If you just want to understand what’s in a casatable file tree, you need to build this whole suite of libraries, although to be fair you only need to link with a subset of them. But still, this is ridiculously onerous for what should be a low-level operation: <em>understand the contents of this container</em>.</p> <p>If you use CASA, your analysis sessions are creating casatables data all over the place: MeasurementSets, calibration files, images, source lists, and probably other kinds of data as well. With the standard CASA software, if you want to interact with these files, you need to use these unwieldy C++ libraries — either directly, or using wrappers that rely on them, like the <a href="https://github.com/casacore/python-casacore">casacore Python bindings</a>. I can say from experience that doing so is pretty unpleasant, which in turn limits the development of a software ecosystem around these kinds of data. I’m absolutely certain this is why we see the various efforts to express MeasurementSet data to other serialization formats like <a href="https://arrow.apache.org/">Arrow</a> that I mentioned above.</p> <p>This is, finally, where <a href="https://github.com/pkgw/rubbl"><code>rubbl_casatables</code></a> comes in. This <a href="https://crates.io/">Rust crate</a> provides support for using the casatables container format in a self-contained library that provides a hopefully-clean API.</p> <p>It accomplishes this through brute force: it bundles the subset of the <a href="https://github.com/casacore/casacore/">casacore</a> C++ code needed to work with casatables data, and nothing more. There are a small number of modifications to make the codebase more “standalone”, but in the end only a few are needed. Compared to stock casacore, there are “only” 783 C++ source files to compile, and no external dependencies.</p> <p>The design of the Rust packaging ecosystem plays a major role here. While Rust crates are basically reusable libraries in the C/C++ tradition, you don’t compile them into shared libraries and install them into <code>/usr/lib64</code>. Rust, like <a href="https://go.dev/">Go</a>, has a static-first model. These languages strongly encourage you to only produce binary executables, not shared libraries. If your executable depends on another package, you compile it directly into the executable, rather than installing it as a separate shared library that your executable then depends on. This makes for some relatively large binaries (you're including all of the code that would be separated out into a library) and compile times (you're compiling all of that code from scratch, as well), but in my experience it’s absolutely the right paradigm. We’ve come a long way from <a href="https://en.wikipedia.org/wiki/DLL_Hell">DLL Hell</a>, but every shared library dependency still adds a number of failure modes to a software deployment.</p> <p>My main use of <a href="https://github.com/pkgw/rubbl"><code>rubbl_casatables</code></a> is in a companion project called <a href="https://github.com/pkgw/rubbl-rxpackage"><code>rubbl-rxpackage</code></a>, which implements a few low-level data-processing tasks that are not available in mainline CASA. The most interesting one is <a href="https://github.com/pkgw/rubbl-rxpackage/blob/master/src/peel.rs">the key data transformation</a> needed to support <a href="https://newton.cx/~peter/2024/peeling-tool/">the peeling workflow that I’ve developed</a>. Another good example is a utility called <a href="https://github.com/pkgw/rubbl-rxpackage/blob/master/src/spwglue.rs"><code>spwglue</code></a> the merges together adjacent spectral windows in a MeasurementSet. My first implementation of <code>spwglue</code> was in Python using the casacore Python bindings; porting to Rust sped it up by a factor of <strong>twenty</strong>.</p> <p>More broadly, these kinds of programs <em>could</em> be implemented in C++, but are much, much less pleasant to do so. Besides the heaviness of the casacore library dependency, I cannot emphasize enough how much I dislike working in C++ (see <a href="https://mastodon.world/@vitaut@mastodon.social/112957324812448747">timely toot</a>). C++ was a step forward in its time, but “its time” was thirty years ago. We’ve learned <em>so much</em> about designing programming languages since then, both in terms of ergonomics and safety. Obviously there are immense amounts of legacy code and expertise that we can’t and shouldn’t just throw away, but there are huge benefits when you can move to better tools. Rust makes me actually enjoy writing systems-level code, while (and in no small part because) it simultaneously gives me much more confidence in that code’s correctness.</p> <p>As I mentioned at the top, the <a href="https://github.com/pkgw/rubbl/releases/tag/rubbl_casatables%400.8.0">new release of <code>rubbl_casatables</code></a> primarily upgrades the backing C++ code from version 3.1.1 of casacore to version 3.5.0. To the best of my knowledge, this shouldn’t affect much in the way of major functionality, but it does include a bunch of that necessary C++ maintenance that I mentioned above. <a href="https://github.com/d3v-null/">@d3v-null</a> undertook this project to get the codebase building on additional CPU targets, and put in a lot of effort to untangle the casacore changes necessary to perform the update — see issue <a href="https://github.com/pkgw/rubbl/issues/345">#345</a> for all of the gory details. It took me a very long time to follow up on all their work, which is something I hope to do better about going forward.</p> <p>The only really tricky thing I did was <a href="https://github.com/pkgw/rubbl/pull/393">address a longstanding weakness</a> dealing with some uses of uninitialized memory in the Rust side of the casatables implementation. We had some code that would allocate uninitialized array buffers and then pass them off to the casacore C++ code to fill with data. This is the kind of thing that, coming from a C/C++ background, is second nature, but it turns out that you actually have to do this with a lot more care than you might think. <a href="https://github.com/rust-lang/rfcs/blob/master/text/2930-read-buf.md#background">Rust RFC 2930</a> has a good discussion of the issues at play; see also <a href="https://www.ralfj.de/blog/2019/07/14/uninit.html">this blog by Ralf Jung</a>. This is the kind of thing that makes me really thankful that people involved in Rust are super fastidious about getting right, while there’s so much legacy C++ out in the world that you can, at best, only hope that a nontrivial project does the right thing consistently.</p> <p>You might wonder about this “Rubbl” name. Rubbl is an umbrella collection of my foundational Rust projects relating to astrophysics (“Rust + Hubble”), mostly data formats: casatables, FITS, and <a href="https://github.com/astroumd/miriad">MIRIAD</a>. I hope that it at least has the possibility of one day growing into something truly foundational like <a href="https://www.astropy.org/">Astropy</a>, but right now the casatables implementation is basically the only thing that gets regular usage. My plan is that if I find myself needing to write data analysis code that would be in C or C++, I’ll try to do it in Rust and extend Rubbl as needed. So far, that need has only come up rarely, although the <a href="https://github.com/pkgw/rubbl-rxpackage"><code>rubbl-rxpackage</code></a> tools are great examples of times that it has. It’s been very pleasing that the existing code has at least been useful enough for people like <a href="https://github.com/d3v-null/">@d3v-null</a> to get involved.</p> <p>The DOI of this new release is <a href="https://doi.org/10.5281/zenodo.13315460">10.5281/zenodo.13315460</a> (automatically registered in Zenodo with <a href="https://pkgw.github.io/cranko/">Cranko</a>). The API docs for <code>rubbl_casatables</code> may be found <a href="https://docs.rs/rubbl_casatables/latest/rubbl_casatables/">on docs.rs</a>.</p> Addendum: Seven-Figure Scientific Software Projects Thu, 08 Aug 2024 12:45:41 -0400 https://newton.cx/~peter/2024/seven-figure-addendum/ https://newton.cx/~peter/2024/seven-figure-addendum/ <p>There’s at least one important point that I forgot to make in <a href="https://newton.cx/~peter/2024/seven-figure-software/">yesterday's post</a>.</p> <span id="continue-reading"></span> <p>I wrote that while software projects are in many ways like hardware projects, one noteworthy difference is that they’re basically made <em>entirely</em> out of people. While an established hardware development program is supported by durable physical assets like lab equipment, an established software development program is, to a very good approximation, completely intangible. This makes it extra-important to provide stable, sustainable support for the people involved. Of course, this is not to imply that such stability isn’t important for hardware programs too, or that people who build hardware are interchangeable.</p> <p>What I forgot to point out is that this is also the case for <em>core astronomical research itself!</em> Most established research programs in exoplanets, gamma-ray bursts, or whatever, are likewise made entirely out of people. I’m quite sure that folks with more equipment-intensive programs, like lab astrophysicists, would agree that what they’ve built is mostly about human expertise, rather than the hardware.</p> <p>And universities have a time-honored mechanism for recruiting and retaining the experts around whom these research programs are centered. <strong>You hire them and you give them tenure.</strong> It’s a decently successful approach.</p> <p>This is really the thing that’s been bothering me. It is indeed very difficult to recruit and retain the skilled professionals needed to build high-quality software in a university context. But we already have the systems in place to do this! We just need to collectively decide that software leads deserve tenure-track slots.</p> <p>Now it remains true that building astronomical software, like building hardware, is not quite the same thing as doing astronomical research itself. The act of writing software can (ideally) create tools that lead to new astronomical knowledge; and, something that’s pretty underrated, it can help capture and make explicit existing or previously-vague astronomical knowledge. But it doesn’t create new astronomical knowledge <em>per se</em>. Furthermore, I’m particularly interested in infrastructural software, which you could view as being an extra step removed from knowledge creation: tools that help you create more tools.</p> <p>And my impression — maybe it’s wrong — is that at the moment, to get hired onto the tenure track as a tool-builder of either the software or hardware variety, you need to have a story about how you’re going to bring in a lot of money to support your activities. A plan to operate in artisan craftsmanship mode, just you and an apprentice in the basement, won’t get you hired. Hence my focus on grantmaking in <a href="https://newton.cx/~peter/2024/seven-figure-software/">the previous post</a>.</p> <p>The other analogy to mention is that the commonplace uncertainty and unpredictability of the software design process is <em>also</em> very reminiscent of the uncertainty and unpredictability of scientific research itself. You try stuff; you realize that your first plan wasn’t working; you pivot; something surprising catches on with your colleagues. So in this way too, software development should be a very comfortable fit for academia. Innovative hardware development has the same features too, but for understandable reasons the <em>speed</em> at which things in software can evolve feels a lot closer to the speed at which the science itself can move.</p> <p>All of this makes me optimistic that there’s a way to find a home for innovative scientific software development — and developers — in the university system. But it would certainly be a lot easier to do so if we had more well-established avenues for PIs to get funding for substantial software projects.</p> Seven-Figure Scientific Software Projects Wed, 07 Aug 2024 13:20:00 -0400 https://newton.cx/~peter/2024/seven-figure-software/ https://newton.cx/~peter/2024/seven-figure-software/ <p>“Get this — I just got a six-million dollar grant to develop a new astronomical image viewer!” That’s not something that’s ever been said, I’ll wager. But why not?</p> <span id="continue-reading"></span> <p>I started thinking about this in the context of a slogan that I’ve been toying with for a little while: “Every astronomy department should have a tenured software specialist”. Bold (and self-serving), I know. But it’s almost defensible, I believe, if we think optimistically based on the analogy to hardware instrumentation development. In both cases we’re talking about people who build tools. We have well-known problems giving these people enough credit, but I’d like to think that astronomers generally appreciate that our field moves forward on the basis of their work. Having a tool-builder in-house gives your faculty a leg up on the competition. Developing new tools tends to be expensive, and to require specialized skills … but by the same token, good tool-builders should be able to bring in a lot of overhead!</p> <p>And this is true of hardware development. Compared to your baseline generic <a href="https://new.nsf.gov/funding/opportunities/astronomy-astrophysics-research-grants-aag">NSF AAG</a> research grant of around $500k, hardware projects can access bigger pots of money. To pick a few awardees from the older <a href="https://new.nsf.gov/funding/opportunities/mid-scale-innovations-program-astronomical">NSF MSIP</a> program, you can get $2.5 million to <a href="https://www.nsf.gov/awardsearch/showAward?AWD_ID=2216481">build an exoplanet imaging spectrograph for Keck</a> (UC Irvine), or around $7 million for <a href="https://www.nsf.gov/awardsearch/showAward?AWD_ID=1836002">an integral field spectrograph for Magellan</a> (MIT). You can get a lot more, of course, as you scale up from individual instruments to whole facilities.</p> <p>If I’m applying for tenure-track positions as a person who builds software (I’m not — that ship has sailed), I want to be able to tell a story about how I’ll secure grants in that seven-figure-and-beyond range. Even if we ignore the very real fact that people do care how much overhead you bring in, this is simply the scale of funding that you need if you want to start something important that has the chance to make a lasting difference in the field. (Something like <a href="https://emcee.readthedocs.io/">emcee</a> might be an exception, but I also bet that emcee would have a lot more impact if a few million dollars were spent on it!) For reference, typical seed funding rounds for Silicon Valley startups <a href="https://carta.com/learn/startups/fundraising/seed-funding/">look like they’re $3 million</a>, and Series A rounds are larger by a factor of a few.</p> <p>Compared to the hardware domain, though, it’s a <em>lot</em> harder to tell that story. Your intuition probably screams out that you’d have zero chance of getting the NSF to hand you seven-figure sums on the basis of “I have an idea for the next <a href="https://astropy.org/">Astropy</a>” or even much more specific, but still ambitious projects like “I want to build a new VLBI data reduction package”. <a href="https://new.nsf.gov/funding/opportunities/advanced-technologies-instrumentation-astronomical">NSF ATI</a> (grants going up to ~$2M, total pool this year of ~$8M) nominally supports software development but the framing of the program (“enable observations for ground-based astronomy that are difficult or impossible to obtain with existing means”) makes that a virtual non-starter, and I don't see any pure-software projects in <a href="https://www.nsf.gov/awardsearch/advancedSearchResult?ProgEleCode=121800&amp;BooleanElement=Any&amp;BooleanRef=Any&amp;ActiveAwards=true#results">the recently awarded ATI projects</a>.</p> <p>Now, there is <a href="https://new.nsf.gov/funding/opportunities/cyberinfrastructure-sustained-scientific">CSSI</a> out of the NSF <a href="https://new.nsf.gov/cise">CISE</a> directorate: “Cyberinfrastructure for Sustained Scientific Innovation”. This is probably the closest in spirit to the kind of funding that I’d like to see, and the program scale is in the right ballpark. CSSI “Framework Implementation” awards come in around $2 million. But there are planned to be around ten of these given out in the 2024 round, across pretty much the whole NSF; framework implementations are “aimed at solving common research problems faced by NSF researchers in one or more areas of science and engineering”. This is all well and good, but think of the hardware analogy: would that Keck imaging spectrograph fit that definition? <a href="https://worldwidetelescope.org/">WorldWide Telescope</a> got <a href="https://www.nsf.gov/awardsearch/showAward?AWD_ID=2004840">a smaller CSSI Elements grant</a>, and I would love to go for a Framework Implementation, but it would be a difficult sell.</p> <p>In the current environment, if you want access to substantial resources for software development, you can tune your CSSI pitch, and you can try to piggyback on tangible facilities: maybe you can secure a big subaward to develop something like a pipeline for a major observatory. That’s simply where the significant pots of money for astronomical software development can actually be found — attached to very large projects like <a href="https://www.lsst.org/">Rubin</a> and space missions.</p> <p>These projects are only going to support certain kinds of software development, though. Not to undersell the importance of pipelines and other facility-type software, but when I think of software efforts that ambitious “software instrumentalists” would want to be able to point to as significant professional accomplishments, I think of things like <a href="https://astropy.org/">Astropy</a>, <a href="https://jupyter.org/">Jupyter</a>, <a href="https://docs.mesastar.org/">MESA</a>, or <a href="https://sites.google.com/cfa.harvard.edu/saoimageds9">ds9</a>, the project behind of this year’s <a href="https://www.adass.org/softwareprize.html">ADASS Software Prize</a>. These are also the kind of project that we need much more of, I think. Historically, people have found ways to support work on these sorts of foundational systems through facilities funding (ds9 probably being the best example), but as funding gets tighter, software gets more expensive, and people appreciate more and more just how difficult software projects are, this approach seems less and less viable to me.</p> <p>There’s a much bigger problem here than simply the lack of an appropriately-targeted funding program, though. As almost everyone has come to recognize by now, most software projects are <strong>fundamentally</strong> different undertakings than hardware projects, in ways that have significant implications for how they need to be supported and managed. This is despite the fact that in other ways, software and hardware projects indeed have much in common.</p> <p>Consider some of the MSIP examples above. <em>Some</em> of the key aspects of the deliverables are extremely concrete: I will build a spectrometer with such-and-such resolving power, operating in such-and-such waveband, attaching to the back of such-and-such telescope. It’s possible to specify software deliverables in the same way: Astropy will allow users to load FITS files; <a href="https://casa.nrao.edu/">CASA</a> will allow users to calibrate VLA data. You <em>can</em> build software this way, and sometimes you have to; but even the most straitlaced engineering organizations now understand that software-by-specification is at best a deeply limited approach. Say what you will about <a href="https://agilemanifesto.org/">agile</a>, <a href="https://www.scrum.org/">scrum</a>, and the rest, but these methods were invented because traditional ones were utterly failing in the software context.</p> <p>Many thousands of words have probably been written about “why software is different.” To a certain extent, the specific reasons probably aren’t even that important. But as someone who cares a lot about the quality of software, in the gestalt <a href="https://en.wikipedia.org/wiki/Pirsig%27s_Metaphysics_of_Quality">Zen-and-the-Art-of-Motorcycle-Maintenance sense</a>, I can say that I find the things that make certain pieces of software the <em>most</em> exciting and inspiring are the things that are farthest from what would be captured in a typical specification. <a href="https://git-scm.com/">Git</a> versus <a href="https://subversion.apache.org/">Subversion</a>, <a href="https://ninja-build.org/">Ninja</a> versus <a href="https://en.wikipedia.org/wiki/Make_(software)">make</a>, <a href="https://beancount.github.io/">Beancount</a> versus <a href="https://hledger.org/">hledger</a>, <a href="https://rust-lang.org/">Rust</a> versus <a href="https://en.wikipedia.org/wiki/C%2B%2B">C++</a>: each pair of tools would likely satisfy the same written spec, but you’ll never convince me that they’re of equal quality.</p> <p>Anyway, all that is to say that in my view, the reason that the NSF doesn’t have a great way to give you $5 million to build the next Astropy is that everyone involved recognizes that doing so would rarely yield good results within the current framework. You could have very little confidence up-front about what was going to come out of the whole effort, and it would be really easy to spend all that money and get with something that no one actually wanted to use. The early-2000’s <a href="https://ui.adsabs.harvard.edu/abs/2001ASPC..238....3S/abstract">US NVO</a> experience isn’t exactly inconsistent with all this. I’ve been harping on the NSF here for specificity, but any traditional grantmaker is going to face the same issues.</p> <p>It’s true that projects that have <em>already</em> achieved a high level of significance can attract big grants: Jupyter landed <a href="https://blog.jupyter.org/new-funding-for-jupyter-12009a836867">$6 million in 2015</a>; Astropy broke through with <a href="https://www.moore.org/grant-detail?grantId=GBMF8435">$900k from Moore in 2019</a>. But unfortunately, it’s really, really hard to build up a compelling software project on a series of small grants. My understanding is that <a href="https://www.stsci.edu/">STScI</a> made a long-term investment on the order of tens of millions to get Astropy going, and ds9 has benefited from long-term, steady funding via <a href="https://chandra.harvard.edu/">Chandra</a> — funding that’s now in extreme danger thanks to <a href="https://cxc.cfa.harvard.edu/cdo/announcement.html">Chandra’s budget being blown up</a>. <a href="https://www.plasmapy.org/">PlasmaPy</a> did <a href="https://www.nsf.gov/awardsearch/showAward?AWD_ID=1931388">get $1.4 million</a> relatively early in the project history, but they likely benefited from having an extremely legible pitch: “let’s make an Astropy for plasma physics”.</p> <p>I’m sure that hardware development has comparable bootstrapping problems, but it seems to me that the challenges for software are going to be worse. If you’re starting out a new hardware development program, you might convince the NSF or your institution to invest in lab space, a vector network analyzer, a mass spectrometer, or whatever. If it all goes belly-up, you still have your capital investment. Software projects, on the other hand, are all <a href="https://www.investopedia.com/ask/answers/112814/whats-difference-between-capital-expenditures-capex-and-operational-expenditures-opex.asp">opex</a> to a good approximation — people. If a project fails, you’ll have essentially <em>nothing</em> to show for it. What’s worse, talented people care about things like “whether they will have a job in a year“ or “what their long-term career prospects look like,” whereas vector network analyzers emphatically don’t. I believe strongly that if you want to recruit and retain good software developers, you’ll have to be able to offer them a level of stability and career growth potential that is <em>extremely</em> foreign to university standards. And you’re not going to get good software without good developers.</p> <p>So, how do we make it possible for someone to establish themselves as a “software instrumentalist”? It goes without saying that more funding wouldn’t hurt, but the key point is that if we want to enable significant, innovative, PI-driven scientific software projects — and I think we do — we need different <em>kinds</em> of funding. The software projects that I think, frankly, are the most interesting and valuable entail a kind of uncertainty that does not match well to traditional grant-proposal models, and the challenge is made only more difficult because building a sustainable software-production practice requires stable, substantial investment.</p> <p>Undoubtedly people have ideas about ways that grantmakers could do things differently to better support innovative software development. The obvious source of inspiration would be Silicon Valley: call it the “startup” model. I think the key through-line would be that the funder would have to think of itself as investing in a team of people, rather than a particular product. Startups pivot all the time, after all. Maybe your initial product idea wasn’t any good, but if you can show that you’ve become skilled at figuring out how to build something interesting that people will actually use, that’s a success. I’m not finding anything to link to at the moment, but I’m sure this sort of idea is well-trodden ground.</p> <p>What’s interesting is that I see elements of this approach in the design of the <a href="https://science.nasa.gov/learn/about-science-activation/">NASA Science Activation</a> program, NASA’s umbrella funding vehicle for science education projects. Grants are relatively large and long-lasting; oversight is relatively hands-on, with regular meetings and each project having to retain an external evaluator; and there’s a big emphasis on inter-project collaboration and the development of an overall education-focused community of practice. If I had a really big pot of money to support innovative PI-driven software projects, those would all be things that I’d want to have as well.</p> <p>You could also say that the upshot of all of this is that if you want to produce innovative scientific software, stay out of the universities. Go get a job at <a href="https://www.stsci.edu/">STScI</a> and convince a higher-up to peel off some money to support your vision. It’s not terrible advice, but I’d really like to think that we can do better. I think there are a ton of PI-driven software projects that could be executed for an amount of money that’s totally in line with hardware development efforts, and would deliver comparable if not much more impact for the expense — think Astropy and <a href="https://ui.adsabs.harvard.edu/">ADS</a>. The benefits might be extremely diffuse, but that’s exactly the kind of thing that grantmakers are supposed to figure out how to support.</p> <p>I don’t have a way to make money magically appear, but if it does, the key is to be able to spend with confidence. That means having better tools to estimate cost and schedule for specific software projects; a clear idea of how we’re going to do oversight; realistic models for developer retention, software adoption, and other social processes; and ultra-clear definitions of success. If we understand and even embrace the distinctive characteristics of software development, and think carefully about how those characteristics interact with our existing institutions, we can tap into an immense amount of potential.</p> <p><em>See also <a href="https://newton.cx/~peter/2024/seven-figure-addendum/">an addendum</a>.</em></p> Mitigating “Source Splitting” in DASCH Wed, 31 Jul 2024 08:35:22 -0400 https://newton.cx/~peter/2024/dasch-source-splitting/ https://newton.cx/~peter/2024/dasch-source-splitting/ <p>Last week I added a new feature to <a href="https://dasch.cfa.harvard.edu/">DASCH</a>’s new data analysis toolkit, <a href="https://dasch.cfa.harvard.edu/drnext/"><em>daschlab</em></a>. There is now code that aims to help mitigate the <a href="https://dasch.cfa.harvard.edu/drnext/ki/source-splitting/">“source splitting” known issue</a>, in which photometric measurements of a single source end up divided among several DASCH lightcurves. Supporting this new feature are some API changes in <em>daschlab</em> that will affect most existing analysis notebooks, a new tutorial notebook, and associated documentation updates.</p> <span id="continue-reading"></span> <p>“Source splitting” is the name that I’ve given to a phenomenon that occurs in the DASCH data with some regularity. Sometimes, if you pull up the lightcurve for a decently bright star, there will only be a handful of detections, when there should instead be thousands. If you dig deeper, what you’ll usually find is that the detections of your source have been erroneously associated with other nearby stars in the reference catalog. The detections are all there, but they’re mislabeled as to which star they belong to.</p> <p>To understand how this happens, it helps to think about how DASCH lightcurves are constructed. The pipeline processing of each plate image is totally standard: extract a source catalog from an image, then derive astrometric and photometric calibrations. Once this is done, the DASCH pipeline “pre-compiles” lightcurve photometry: the image catalog is matched to a reference catalog (“refcat”; <a href="https://www.aavso.org/apass">APASS</a> or <a href="https://archive.stsci.edu/hlsp/atlas-refcat2">ATLAS-REFCAT2</a>), and any matched detections are appended to a giant database of photometry for every refcat source. This scheme is basically how every time-domain lightcurve database works.</p> <p>(At the moment, the photometric calibration and per-image source catalog are then basically thrown away. This means that DASCH cannot support forced photometry at specific locations, among other useful operations. Hopefully we’ll be able to add this capability in the future.)</p> <p>One of the issues with DASCH, though, is that our astrometry is quite uncertain. A major factor here is the fundamental fact that the underlying data are analog, so we have to construct a digital astrometric solution. But also, DASCH plates often came from tiny telescopes, which means that they both cover huge areas of the sky and often have significant distortions away from the optical axis. The pipeline can do an extremely good job of astrometric calibation, but in a collection of more than 400,000 images, there are going to be errors.</p> <p>Personally, I’ve found it hard to get used to the fact that if you have a stack of DASCH cutouts that are nominally centered around some RA/Dec, the coordinates for some images might have <em>significant</em> systematic offsets. If I have a bunch of detections of a source, I’m used to, say, plotting the detection RA/Decs and flagging ones with large offsets as outliers. But in DASCH, those detections might be totally fine — it could be the WCS solution that’s wrong, not the source measurement.</p> <p>Making all of this worse is that the DASCH collection is so heterogeneous. Some plates in the collection have spatial resolutions 25 times better than others. For the highest-resolution plates, you might have a region where in the best cases you can accurately assign photometric measurements to any of, say, a dozen stars. For the lowest-resolution plates, all of those stars might blend together, and an astrometric solution that’s overall quite good might still cause one star to be misidentified with another. You can see how this would lead to source splitting.</p> <p>Finally, there’s a contributing factor that I have to confess that I don’t fully understand. The DASCH pipeline had a lot of infrastructure to search for transients, like flaring X-ray binaries. As best I can tell, the pipeline had some mechanism to identify promising transients and add them to the refcat, presumably with the goal of identifying additional outbursts if there should be any. But this functionality seems to have interacted poorly with the astrometric issues above. I believe, but am not entirely sure, that in some cases the pipeline would end up creating new refcat entries for “transients” that were really well-known sources identified with the wrong location due to astrometric calibration errors. All of this is clouded in uncertainty because I’ve chosen to completely ignore the issue of DASCH transients for the time being, so I haven’t spent any time learning about this aspect of the historical pipeline. There’s a ton of scientific potential in this area, but I think that the underlying calibrations and data products need to be improved before transient searches will really become worthwhile.</p> <p>The good news is that when source splitting happens, it’s generally a well-behaved phenomenon. You’ll generally find that the reference catalog contains a number of entries near your target of interest (say, 20 or 30 arcsec), and that detections of that target are in effect randomly assigned to one of those targets. So, for any given exposure containing your source, exactly one of those refcat entries has a good detection, and the rest are upper limits. “All” you need to do is merge the detections together and ignore the upper limits.</p> <p>The <a href="https://dasch.cfa.harvard.edu/drnext/"><em>daschlab</em></a> software now supports this, in the function <a href="https://daschlab.readthedocs.io/en/latest/api/daschlab.lightcurves.merge.html"><code>daschlab.lightcurves.merge()</code></a>. Given a set of input lightcurves, it will do precisely what I just wrote, with one minor elaboration. First, the algorithm matches up all of the lightcurves by exposure and groups together entries that are totally unambiguous: cases where there is exactly one detection among all of the lightcurves. It will then use those data to determine a mean source magnitude. Next, it will make a pass over the ambiguous cases: ones where, at a given exposure, multiple lightcurves contain detections. For those, it will choose the lightcurve point whose magnitude is closest to the mean. This is a pretty dumb heuristic, but in my samples there are only a handful of ambiguous points (fewer than ten, in lightcurves with thousands of detections), so in most cases it should be good enough.</p> <p>To demonstrate this code, I created a new <a href="https://dasch.cfa.harvard.edu/drnext/#tutorials">tutorial notebook</a> for the example eclipsing binary <a href="http://simbad.u-strasbg.fr/simbad/sim-id?protocol=html&amp;Ident=hd%205501&amp;NbIdent=1&amp;Radius=2&amp;Radius.unit=arcmin">HD 5501</a>. For whatever reason, the “official” DASCH catalog match for this source only has a few dozen detections; the vast majority of the detections are associated with a source about 10 arcsec away. The notebook demonstrates how to identify this issue, how to choose which lightcurves to merge, and how to actually do the merge. In this case, the number of usable lightcurve points goes from about 2,000 (associated with a single refcat source) to about 2,500 (merged from ten separate lightcurves).</p> <p>Of course, this is one of those situations where software support is nice, but the real thing to do is to <strong>actually fix the problem</strong>. I think that the pipeline’s matching code probably ignores a-priori brightness information inappropriately. If I have a measurement of 10.0 mag, I should match it with a refcat source logged with a mean brightness of 10.1 mag, even if the coordinates of the detection are nominally slightly closer to something with a brightness of 15.0 mag. And, of course, <a href="https://newton.cx/~peter/2024/dasch-astrometry/">improving the astrometric calibration</a> will help. But I don’t want to get deep into mucking around with the photometry / lightcurve pipeline right now, so it seemed worthwhile to try to provide a mitigation in the meantime.</p> <p>To support the merge functionality, I made an update to <em>daschlab</em> that I’ve been avoiding for a little while. One of the key tables in a <em>daschlab</em> analysis session used to be the “plate list”: a table of information about every plate overlapping the coordinates of interest. But, this was actually the wrong table to offer. Why? Because some plates record multiple exposures, and each exposure has its own sky coordinates. Sometimes, these exposures were of totally different parts of the sky: an equatorial survey field and a polar calibration field, for instance. So we should really have a table of <em>exposures</em> that overlap your target. There might be multiple exposures on a single plate, each of which covers your target but maps to a different portion of the plate’s digital image. I realized this a while ago, but put off dealing with it because it would be a pain to make all of the necessary changes. To do the merges properly, though, I needed to get this right, so I’ve torn off the band-aid: the <em>daschlab</em> plate table has becomes the <a href="https://daschlab.readthedocs.io/en/latest/api/daschlab.exposures.Exposures.html"><code>Exposures</code></a> table, and lots of associated aspects of the API have been reworked. This is a somewhat annoyingly invasive change, but these early days are absolutely the time to get things right. My apologies if this has messed up anyone’s existing notebooks.</p> <p>One additional wrinkle is that exposures come in two kinds. From the written logbooks that were created contemporaneously with the plates, we know about what exposures should exist in principle: plate <a href="https://starglass.cfa.harvard.edu/plate/c09375">C09375</a> is logged to have two exposures. But we can also identify exposures from astrometric analysis of plate images: the DASCH pipeline uses the <a href="http://astrometry.net/">Astrometry.Net</a> software to find an astrometric solution, analyzes it, then searches for another solution hidden in all of the sources that don’t match to the refcat using the first one. Do these two kinds of analyses always agree? Hah — of course not.</p> <p>The DASCH data model therefore counts both “exposures,” described in the historical logbooks, and “WCS solutions,” obtained from analysis of plate images. When possible, WCS solutions are matched to known exposures. But sometimes you have more exposures than solutions, and sometimes you have more solutions than exposures. If a plate has been scanned, some exposures might be mappable to the scan image (if it has an associated WCS solution), and some might not.</p> <p>While exposures lacking WCS solutions obviously don’t have high-quality astrometric information, we do have basic information about their rough sky positioning, so we’re still interested in them. The <em>daschlab</em> exposures list therefore includes all available information. This means that some exposures in the list can be mapped to a digital image, and others can’t — and the latter situation may occur even if the plate in question has been scanned.</p> <p>Correcting this situation caused me to come to understand an important limitation of the DASCH mosaic-cutout software. In both <a href="https://dasch.cfa.harvard.edu/data-access/#cannon-data-portal">the “Cannon” legacy DASCH data access portal</a> and <em>daschlab</em>, the cutout-generating software doesn’t know about multiple exposures. If you request a plate cutout at a certain position, you’ll get a cutout using the first derived WCS solution (which may or may not be the first logged exposure), even if perhaps those coordinates only land on the plate using one of the other WCS solutions. Due to the way that the legacy pipeline handled multiple-exposure plates, I can’t easily fix this issue immediately. But, in the aftermath of <a href="https://newton.cx/~peter/2024/dasch-astrometry/">my reprocessing of the DASCH astrometry</a>, the associated data products are going to be significantly improved, and I’ll address this problem.</p> What TeX Gets Right Wed, 17 Jul 2024 08:22:08 -0400 https://newton.cx/~peter/2024/what-tex-gets-right/ https://newton.cx/~peter/2024/what-tex-gets-right/ <p>Last week <a href="https://newton.cx/~peter/2024/the-new-latex/">I wrote about how LaTeX might evolve to stay relevant</a> to 21st-century technical writing. But technology has come a long way since the 1970’s. Should we even be encouraging people to create documents using the venerable TeX language, which was designed at a time when computers — and computing — were so different than they are today? This week I want to write a bit about why it’s worth the effort to build on TeX/LaTeX, instead of starting fresh.</p> <span id="continue-reading"></span> <p>(This post is strongly derived from <a href="https://github.com/tectonic-typesetting/tectonopedia/blob/main/txt/explain/why-tex.tex">“Why TeX?”</a>, an explainer I’ve written as part of my prototyping of the <a href="https://github.com/tectonic-typesetting/tectonopedia">Tectonopedia</a>.)</p> <p>First things first: I'll happily admit that there are plenty of circumstances where TeX is not the best solution, and you'll be better off using some other kind of technology — whether that's <a href="https://en.wikipedia.org/wiki/Markdown">Markdown</a>, <a href="https://www.microsoft.com/en-us/microsoft-365/word">Microsoft Word</a>, pen and paper, or whatever else. The very notion of “creating documents” is so broad that it should go without saying that no single system is going to be the best choice for every situation.</p> <p>That being said, my work on Tectonic is inspired by the belief that despite its age TeX is still the very best tool in the world for solving certain kinds of problems. For instance, if you know one thing about TeX, it's probably that it's good for mathematics. And that reputation is well-earned! A proficient TeX user can easily write a single line of code to conjure up complex typeset equations.</p> <p>But what should actually impress you more than long, complex equations are equations typeset inline with body text, like <em>y = x²</em>. Readability requires that the placement, sizing, and appearance of the math and text symbols all agree well, issues that can be fudged a bit in “display” equations. TeX is one of the few tools out there that can get all these things just right. (This blog is <em>not</em> typeset using TeX, but the above equation is also quite simple; I’d say that it looks OK but not great in my browser.)</p> <p>But we wouldn't be writing all this verbiage in honor of a finicky math layout algorithm. Why can't other tools just copy TeX’s algorithms and do math equally well? My claim is that the real challenge of typesetting mathematics is that written math is an open-ended, generative visual language, admitting infinitely varied forms in unique, unpredictable, recursive combinations. TeX can handle math well not because it got some specific fiddly bits right — although it did — but because <em>TeX is itself an expressive, open-ended language</em>. You need an open-ended language to be able to reproduce the open-endedness of written math. Other tools give you building blocks; only TeX gives you the machinery to create and use new blocks of your own. (This is not an equivalent statement, but for what it’s worth, the TeX language is <a href="https://en.wikipedia.org/wiki/Turing_completeness">Turing-complete</a>.) And your own “blocks” can be just as easy and natural to use as the built-in ones. This is far from the only reason that TeX is good at what it does, but it's the most essential one.</p> <p>The power of TeX manifests itself not only in low-level ways, such as math typesetting, but in higher-level ones as well. For instance, scholarly documents have detailed conventions for handling bibliographic references. Although neither the core TeX language nor Microsoft Word have any built-in, structured way to represent reference metadata, TeX has been extended to support them in the <a href="https://en.wikipedia.org/wiki/BibTeX">BibTeX</a> framework. BibTeX's new commands don't feel like awkward extensions: they integrate straightforwardly with the rest of the language and are intuitive for users. While it's true that you can manually typeset your references in Word and assemble them into a bibliography, it's fair to say BibTeX provides a fundamentally more powerful way to work with them. I’ll wager that most people who have gotten the hang of BibTeX would <em>hate</em> the idea of giving it up and going back to managing their references manually.</p> <p>Now, if you’re a healthy skeptic of abstraction, you'll likely respond: “whoa, I don't want some elaborate system that can do anything — I just want a tool that helps me get the job done”. This is the right response! Human lifetimes have been wasted on the refinement of elegant but useless ideas, and we have deadlines to meet. But hopefully that you'll agree that in some situations, a system of abstraction is exactly what you need to get the job done. Try doing physics without calculus.</p> <p>I'll assert, but can't possibly prove, that once you stop accepting the limitations that less powerful tools impose on you, you’ll start seeing opportunities to use TeX's capabilities everywhere. Many kinds of documents have a sort of “internal logic” that becomes easier to express given the right tools. That being said, the ones where TeX's capabilities generally add the most are ones where this internal logic is easy to find: documents with lots of cross-references, figures, tables, equations, or code — the things that I call technical documents. And it’s likely no coincidence that these are the sorts of documents that TeX has historically most often been used to create.</p> <p>It’s worth emphasizing that this “internal logic” of certain documents can be open-ended and generative in the same way that written math notation is. For instance, I often find in API documentation that certain software frameworks introduce new conceptual structures that don’t exist in the underlying implementation language. One example would be the notion of store <a href="https://pinia.vuejs.org/core-concepts/state.html">state</a> introduced in the <a href="https://pinia.vuejs.org/">Pinia</a> framework. You can write API documentation that discusses this state in terms of concrete JavaScript/TypeScript statements, but it’s really a new concept that deserves to be documented as a “first-class” object on equal footing with other pre-existing language concepts like classes and methods — I should be able to reference “the <code>isAdmin</code> state variable of the <code>Server</code> store” in a convenient and natural way. A documentation framework needs to give you the tools <em>to give yourself the tools</em> to do this.</p> <p>I’ll also boldly claim that despite its internal sophistication, TeX is easy to start using. I can't deny that TeX has a reputation for confusing output and sometimes inexplicable behavior, or that there are reasons that this reputation is deserved. Nevertheless, I'll point out that many mathematicians and scientists who do not care <em>at all</em> about its guts successfully use it for everyday work, even if it drives them off the wall at times. You can think of it as being a bit like <a href="https://git-scm.com/">git</a> in this way.</p> <p>Now, it’s certainly possible that one could develop a new, generative typesetting language that captures the virtues that I’ve discussed above <em>and</em> is free of TeX’s historical baggage. If you asked me to take on that task, I’d ask for … maybe a decade to do it? Designing a nicer syntax is one thing; building a whole new ecosystem is another. TeX may be old, but by the same token it is <em>battle-tested</em> and amazingly reliable — its parser can recover from the most pathological, hateful input documents that you can conceive of. While this kind of robustness often comes at a performance cost, TeX is <em>fast</em> and ingeniously efficient. It is <em>supported</em> by a worldwide community of users, who have gone to incredible lengths to modernize it and develop a dizzying array of extension packages and supporting software tools. A lot of very smart people have put a lot of effort into this language, which is still going strong after forty years — and those facts tell you something important.</p> <p>That is not to imply that today's TeX is perfect — <a href="https://newton.cx/~peter/2024/the-new-latex/">far from it</a>. The error messages are famously hard to understand. Its documentation is, ironically, a mess. Indeed, a major premise of <a href="https://tectonic-typesetting.github.io/">the Tectonic project</a> is that some aspects of the TeX ecosystem are in need of dramatic change. But not all of them. TeX <em>can be</em> the document language that the 21st century deserves.</p> LaTeX Can Be The New LaTeX Tue, 09 Jul 2024 11:29:08 -0400 https://newton.cx/~peter/2024/the-new-latex/ https://newton.cx/~peter/2024/the-new-latex/ <p>There’s a lot of interest in modernizing the tools for scientific and technical typesetting. Tools like <a href="https://mystmd.org/">MyST</a>, <a href="https://nota-lang.org/">Nota</a>, <a href="https://quarto.org/">Quarto</a>, <a href="https://idl.uw.edu/papers/living-papers">Living Papers</a>, <a href="https://show-your.work/">showyourwork!</a>, and many others are trying to make it easier — well, possible — to create technical documents that take advantage of the capabilities of today’s digital environment. Of these systems, different ones aim to work at different levels (low-level typesetting, vs. complete document authoring systems) but we can broadly think of them as tools aiming to become “the new LaTeX”.</p> <p>I’m not sure if I’ve made the argument in writing before, but I believe that “the new LaTeX” could, maybe even should, be founded on … LaTeX. Not LaTeX as it currently exists, but LaTeX as it could be.</p> <span id="continue-reading"></span> <p>There are two parts to this argument. The first is that TeX/LaTeX (hereafter just TeX, to emphasize the underlying language) gets some things right that are worth preserving; the second is that its problems are fixable. This post is going to skip the first part, and simply observe that here we are in the year 2024 still using TeX for precision technical typesetting, despite its age and all of its problems. It is probably the oldest piece of end-user software that’s still in common usage — TeX predates Unix! It’s worth pondering why TeX is still relevant, but clearly it’s doing <em>something</em> right. A huge amount of human effort and ingenuity has gone into designing the TeX ecosystem; if we can build on that instead of throwing it all away, that’s a huge win for everyone.</p> <p>That being said, TeX as it currently exists absolutely has a ton of issues. It’s annoying to install. The error messages are inscrutable. The syntax is fully of booby traps. It only truly targets PDF output. There are a dozen different ways to do the same thing and you need to be an expert to understand which is appropriate for your situation. It can feel impossible to dig up useful documentation on even the most basic commands. And so on, and so on.</p> <p>The ultimate causes of virtually all of these problems, I’d claim, are that TeX is simply very, very old, and that from the start the culture of TeX has been obsessed with stability. Neither of these are intrinsically bad things, of course! But there are also advantages to being less laden with historical baggage.</p> <p>For instance, a major issue with advancing the (La)TeX ecosystem is simply that the core TeX code has traditionally been nearly impossible to hack on. I often mention to people that when I first got interested in trying to modify the core TeX engine, it took me, an expert in this kind of task, something like several weeks to even figure out where the core source code actually lived, and how to compile it myself. That’s, like, the fundamental step involved in getting someone to contribute to your open-source project. If people can’t easily fork your project and try out changes, then no, you’re not going to streamline your installation process, or gradually evolve quality-of-life improvements.</p> <p>The other huge issue that I see is that TeX’s documentation is, to be blunt, <strong>awful</strong>. People sometimes seem to have trouble honing in on this — I mean, isn’t LaTeX literally a document preparation system? Aren’t there thousands and thousands of pages of text been written about every <a href="https://ctan.org/">CTAN</a> package under the sun? And yet. Despite all of that effort, I find that existing TeX documentation generally fails completely at serving my day-to-day needs. I routinely struggle to pull up convenient, clear reference material about how a certain command works, or the design of certain fundamental concepts of the TeX language. The irony is staggering. But we can see how things got to where they are.</p> <p>First, there’s absolutely a first-mover penalty at play here. When TeX was invented, there was no World Wide Web. Software documentation came printed on paper. So that how the whole ecosystem evolved: targeting print formats. The inexorable outcome of that is that these days, all of the ecosystem documentation is locked up in PDFs. They're often very <em>nice</em> PDFs, but that doesn’t do anything to help searchability, cross-linkage, and quick access. Can you think of any software system created in the past, say, decade whose documentation is <em>primarily</em> delivered in PDF format?</p> <p>Second, the TeX world’s obsession with stability led to fragmentation: if you wanted to add a new feature, you had to fork the mainline TeX engine and call it <a href="https://www.ctan.org/pkg/etex">ε-tex</a>, or <a href="https://ctan.org/pkg/ptex">ptex</a>, or <a href="https://www.ctan.org/pkg/uptex">uptex</a>, or <a href="https://www.ctan.org/pkg/pdftex">pdftex</a>, or <a href="https://www.ctan.org/pkg/xetex">xetex</a>, or <a href="https://www.ctan.org/pkg/luatex">luatex</a>, or whatever else. One of the <em>many</em> unfortunate consequences of this is that the documentation has become both fragmented, and highly conditional. The information you need might be out there, but associated with a different engine; or a piece of information might be engine-specific without being labeled as such; or a really thorough piece of documentation might address a bunch of engine options but become nigh-unreadable by virtue of being mired in choices like “if you’re using XeTeX, do <em>this</em>; if you’re using LuaTeX, do <em>that</em>”. Documentation authors can’t assume that certain features are available because, hey, maybe you’re still using Knuth’s original TeX with only 256 registers.</p> <p>The longevity of TeX further complicates things. If you want to look up information about font selection, you might get material that was written before TrueType was invented. This is a good problem to have, in a certain sense, but it makes things more difficult for readers. This is especially true when combined with the ecosystem’s fragmentation and the often baked-in assumption that key elements will never change, so that there’s no need to plan for providing multiple versions of documentation.</p> <p>A particular culprit is <a href="https://search.worldcat.org/title/876762638">The TeXbook</a>. The TeXbook is, undoubtedly, an enormous accomplishment and a stunningly <em>whole</em> work, and it remains the definitive reference for the design and behavior of the innermost aspects of the TeX engine. But it’s also pretty bad documentation. If nothing else, it took me a while to appreciate that many things written in The TeXbook are <strong>simply not true</strong> of modern TeX systems, and haven’t been true for decades; now-fundamental features added in ε-TeX (1996) just don’t exist, as far as The TeXbook is concerned, let alone ones added in more modern extensions like XeTeX.</p> <p>The TeXbook has a further problem that I see mirrored in other major pieces of documentation in the TeX ecosystem. To borrow the terminology of <a href="https://newton.cx/~peter/2023/divio-documentation-system/">Diátaxis</a>, it’s trying to be two kinds of documentation at once: not just a detailed reference of the precise behavior of the system, but also a comprehensive explanation of the system design. These are both really wonderful kinds of documentation to have, but interleaving the two types of material makes the book hard to work with. Descriptions of behavior are scattered around the book based on the overall pedagogical flow; the explanations are repeatedly derailed by precise definitions. It may feel churlish to criticize the book for being <em>too comprehensive</em> but it is a legitimate usability problem if people can’t find the information they need, when they need it!</p> <p>There’s also the basic matter of availability. It is possible to find PDFs of The TeXbook online, but they’re not supposed to exist. Reading between the figurative lines, I’m pretty sure that Knuth had a specific business plan: he’d give the software away for free, but make some money by selling the documentation. Nothing wrong with that choice at the time, but once again: notice that this is not how anyone does things anymore. When you actually have to compete for mindshare, high-quality documentation that’s freely available is one of the most valuable assets you can have.</p> <p>All of this is to say: while there may be a large quantity of documentation about the TeX ecosystem, I find that sadly it’s quite difficult to actually profit from. This is especially unfortunate since TeX is indeed a very complex system that, largely due to its age, does some things in ways that are quite foreign to modern users.</p> <p>The flip side of this is: imagine if TeX’s documentation was as sophisticated and cohesive as the underlying software. <em>It would change everything.</em> I sincerely believe that true best-in-class documentation would completely transform how people feel about using TeX. Instead of being some mysterious anachronism, full of ancient magic, it would be the quintessential power tool, sometimes challenging to master but ready to solve the hardest problems you can throw at it. It would go from dinosaur to thoroughbred.</p> <p>The <a href="https://tectonic-typesetting.github.io/">Tectonic</a> project is, of course, motivated by this vision. At the basic technical level, much of the Tectonic approach is simply about making the TeX engine hackable. As elaborated in <a href="https://doi.org/10.47397/tb/43-2/tb134williams-tectonic">the Tectonic TUGBoat article</a>, at the higher level, the project’s distinctive branding gives us the room to break with the past when appropriate. I can’t deny that launching a new brand only worsens the fragmentation issue, but I think it’s a necessary step to unstick everything else that needs improvement. Thus far the Tectonic project has not provided much at all in the way of new, high-quality documentation, but that’s something of an intentional choice — I’m dissatisfied with the currently available tools and am trying to sketch out a better system in the <a href="https://github.com/tectonic-typesetting/tectonopedia">tectonopedia</a> project. Part of that effort is building a wholly-new set of engine features and LaTeX classes to generate “native” HTML output — that is, targeting original TeX source documents aimed solely at producing optimal HTML outputs, instead of trying to generate good HTML from existing documents.</p> <p>My hope is that Tectonic can help TeX to become a sort of assembly language for complex documents. In plenty of cases, Markdown will be all you need. But if you want to create a sophisticated <a href="https://newton.cx/~peter/2024/digital-docs-are-web-apps/">technical-document-as-web-application</a>, it will probably be one of the final tools that runs before your web bundler, generating precision-typeset prose components and cross-referencing databases, and linking them together with all of the other elements of your document: interactive figures, navigational chrome, and the rest.</p> <p>You might ask: “why does this assembly language need to be TeX?” That gets to the question of what TeX gets right — a topic for <a href="https://newton.cx/~peter/2024/what-tex-gets-right/">another post</a>.</p> Reprocessing DASCH’s Astrometry Wed, 26 Jun 2024 09:30:00 -0400 https://newton.cx/~peter/2024/dasch-astrometry/ https://newton.cx/~peter/2024/dasch-astrometry/ <p>Yesterday I completed a large effort to reprocess all of <a href="https://dasch.cfa.harvard.edu/">DASCH’s</a> astrometry. We now have astrometric solutions for 415,848 plates, and with this groundwork laid I plan to start working on reprocessing the full DASCH photometric database.</p> <span id="continue-reading"></span> <p>There were a couple of motivations for this reprocessing effort. The DASCH astrometric calibrations have been a bit of a mess, because the astrometric solutions were stored as WCS headers attached directly to the “mosaic” FITS files. This is absolutely the most obvious way to store this information, but I’d argue that it’s actually not a great approach from a data-management standpoint. In this scheme if you <em>re</em>-calibrate a mosaic, you need to rewrite the entire FITS file, opening up risks of data loss and confusion due to changing file contents. There’s also no clear way to, say, try a <em>new</em> calibration scheme and compare the results. You could see evidence of the inflexibility of this approach: the DASCH filesystem indicates a mosaic’s level of astrometric calibration in its filename (raw, “WW” for basic localization, “TNX” for distortion correction), and there were lots of mosaics where the filename was inconsistent with the actual FITS headers. There were also mosaics with <em>very</em> old calibrations using IRAF headers that I’ve never even seen before, as well as a few ones seemingly lingering from very early work (e.g., some files with a “W” tag in their name).</p> <p>There were also a lot of files that looked like they <em>should</em> be easily solvable that were missing calibrations, in addition to ones with solutions that were bogus. The latter situation is documented on the DASCH website as the <a href="https://dasch.cfa.harvard.edu/drnext/ki/incorrect-astrometry/">“Incorrect Astrometry” known issue</a>, and it’s the one that bothered me the most. One of the great things about the <a href="https://astrometry.net/">Astrometry.Net</a> software, which is the basis of DASCH’s astrometric solutions, is that it should virtually never produce false positives: if it hands you a solution, that solution is pretty much guaranteed to be pretty good. So, if the DASCH solutions include bogus ones, that means that the pipeline is starting with a decent Astrometry.Net solution and breaking it.</p> <p>I dug into this issue and discovered that it seemed like the primary culprit was the stage of the DASCH astrometric pipeline that comes after Astrometry.Net. Once we get the initial solutions, we feed them into the <a href="http://tdc-www.harvard.edu/wcstools/">WCSTools</a> program <a href="http://tdc-www.harvard.edu/wcstools/imwcs/">imwcs</a>, which attempts to refine the solutions. Unfortunately, <a href="http://tdc-www.harvard.edu/wcstools/imwcs/">imwcs</a> uses a numerical optimizer but doesn’t have any conception of a global goodness-of-fit, which means that if the optimizer somehow converges on an incorrect solution, it’s very hard to step back and say “no, I don’t like where we ended up”. <a href="http://tdc-www.harvard.edu/wcstools/imwcs/">imwcs</a> is also an extremely old tool written in gnarly C, so it’s challenging to improve the code.</p> <p>I did end up creating a DASCH-specific fork of <a href="http://tdc-www.harvard.edu/wcstools/imwcs/">imwcs</a> and modifying a few behaviors. For instance, <a href="http://tdc-www.harvard.edu/wcstools/imwcs/">imwcs</a> optimizes to match the list of sources in your image against a set of catalog sources, which is constructed from a list of the brightest sources in an all-sky catalog in an RA/Dec box around your image’s initial position. This is fine most of the time, but in DASCH where some images are <em>seventy degrees tall</em>, a box in RA/Dec can be vastly bigger than the actual image area, so that downselecting to the brightest sources means that only a handful of reference sources are actually on your image. So, I added a step to limit the catalog results to ones that actually overlap the image’s initial position. (If your image is going to shift over a bit, this might mean that you miss out on a handful of sources that might be useful, but since we’re starting with an Astrometry.Net solution, we know that we’re starting very close to our destination, so the vast majority of useful sources will be covered by our initial guess of the image footprint.)</p> <p>Another significant improvement wasn’t within <a href="http://tdc-www.harvard.edu/wcstools/imwcs/">imwcs</a> itself, but related to the list of image sources fed into it. This list was derived from a <a href="https://www.astromatic.net/software/sextractor/">SExtractor</a> table in a pretty naive way, once again just selecting the brightest sources. This scheme could fail badly for plates with really inhomogeneous backgrounds, cracks, and other defects. Once again, you would end up with a source list that only included a handful of actual stars from the image, which is great way to get the optimizer to drive your solution somewhere unfortunate.</p> <p>There were a lot of other small fixes as well, trying to improve the robustness of the pipeline. In the end, the success rate went from 94% to 97% — or, put another way, I was able to eliminate fully half of the solution failures. What I haven’t yet been able to look into is the number of incorrect solutions coming out of the new codebase. I think that it should be a lot lower thanks to the kinds of improvements I mentioned above, but it’s surprisingly hard to check into this quantitatively. For any given plate, it’s easy to identify a way-off solution by eye, but automating such an analysis to cover the full diversity of the DASCH collection — some plates contain 50 stars, some contain 200,000 — is a lot harder.</p> <p>There are also borderline cases, often relating to plates with multiple exposures. A plate with multiple exposures doesn’t have “an” astrometric solution — it has a set of several solutions. Any given location on the plate has several different, valid RA/Dec coordinates, and a single RA/Dec position may appear at multiple pixel positions on the plate! As you might expect, this can get pretty gnarly to deal with. The DASCH pipeline works by starting with an initial list of all of the sources extracted from a plate image, and then peeling away the sources that match the catalog after the best astrometric calibration is arrived at. The remaining unmatched sources are then fed into Astrometry.Net again, and you iterate until Astrometry.Net stops finding solutions. (So, this process depends highly on Astrometry.Net’s lack of false positives!)</p> <p>These plates can fail in a few hard-to-handle ways, though. Particularly tricky are the ones containing close multiple exposures, like <a href="https://starglass.cfa.harvard.edu/plate/a21562">A21562</a>. If you have two exposures very close to one another, the <a href="http://tdc-www.harvard.edu/wcstools/imwcs/">imwcs</a> optimizer is vulnerable to what I call the “split-the-difference” effect, where it converges to a solution that lands right between each source pair. In a least-squares sense this is indeed the optimal solution, but it’s incorrect. You can also get an effect where if, say, the source pairs are in the going in the left-right direction, the solution has a global skew where it’s bang-on for the left sources in the top-left of the plate, but bang-on for the <em>right</em> sources in the bottom-right of the plate. This one is even trickier to detect since you can have really high-quality matches across large areas of the plate — and since many plates have large-area defects, you can’t expect a plate to have high-quality solutions <em>everywhere</em>. All of these problems get even hairier for plates with <em>many</em> exposures, like <a href="https://starglass.cfa.harvard.edu/plate/c21253">C21253</a>, which appears to contain 11 exposures in a tight spatial sequence.</p> <p>As best I can see, these plates will need to be handled by some preprocessing. You could make a histogram of pixel separations for every pair of sources in an image, and then decompose the peaks to infer how many close-multiple exposures there are. (Not all multiple exposures are close: some plates have, say, one exposure on the celestial equator, and one at the pole.) For instance, if a plate has four exposures, you might find up to six peaks in the source-pair separation histogram: one for each pair of exposures. (But you might not find that many if peaks coincide; e.g., for a sequence of exposures with a roughly regular spacing on the plate.) You could potentially increase the power of the search by adding a delta-instrumental-magnitude axis to the histogramming process, since if you have a pair of exposures where one has half the duration of the other, the repeated source images should all appear fainter by a constant factor, modulo the nonlinearity of the photographic medium. All of this analysis could be done before any astrometric or photometric calibration, and then you could use it to filter down the source lists that you feed into Astrometry.Net and imwcs, hopefully preventing issues like split-the-difference. I’ve looked into all of this in a preliminary way, but unfortunately I don’t know if I’ll get the chance to really pursue this idea.</p> <p>Anyway. Due to issues like the above, there might be more astrometric reprocessing in my future, but hopefully we’ve improved the baseline significantly. The updated astrometric database includes 415,848 solutions, and everything is based on <a href="https://www.cosmos.esa.int/web/gaia/dr2">Gaia DR2</a> via the <a href="https://archive.stsci.edu/hlsp/atlas-refcat2">ATLAS-REFCAT2</a> catalog. This project also got me to really straighten out my understanding of how to handle multiple-exposure plates, as well as some of the technical infrastructure surrounding them. They represent only a fraction of the corpus, but I think it’s important to deal with them properly.</p> <p>The processing took a span of around 46 days all told, starting with a list of 428,416 mosaics. Over the course of that time I did a lot of work to speed the running of the pipeline, so that if I had to restart everything now, I think that I should be able to finish in 17 days or fewer. Fun fact: I only made one improvement to optimize the actual DASCH pipeline code, to fix some catastrophic failures where a certain step would sometimes take ~forever. Literally everything else was about making more optimal use of the <a href="https://www.rc.fas.harvard.edu/about/cluster-architecture/">Cannon</a> cluster resources, which was enough to speed up my processing by a factor of <em>five</em> or more. It pays to understand your platform, folks! The biggest win there was making a change to avoid the <a href="https://slurm.schedmd.com/documentation.html">Slurm</a> scheduler as much as possible and use my own job-dispatching system instead. It’s a little disappointing, in a certain sense, to be able to beat the scheduler at its own game so badly, but on the other hand Slurm is definitely optimized for highly-parallel simulations, not lots of little interdependent data-processing jobs.</p> <p>As a final point, these new astrometric solutions are <em>not</em> yet available in <a href="https://dasch.cfa.harvard.edu/data-access/">the DASCH data access services</a>. As I mentioned at the outset, the legacy DASCH data organization just serves up whatever WCS headers are attached to each FITS image. The reprocessed approach stores the astrometric data separately, as small data packages that I’m calling “results”. Eventually, these results will be made available as their own data products, and the various data access services will combine the mosaic imagery and the astrometric results on-the-fly as appropriate. But, I need to write and deploy the code to do all of that. In the meantime, if you’re interested in the new solutions, get in touch.</p> <p>Now that I have this new baseline of astrometry results, though, I can turn to the next step: reprocessing all of the photometry. Like the DASCH astrometric data, the DASCH photometric databases contain 18+ years of accumulated cruft, and I’m very eager to clean them out! I’ll also use the “result” framework to do a better job of exposing the photometric calibration data, which I expect to be very valuable for people who want to dig into the DASCH lightcurves in detail. In particular, I should finally be able to surface the information that indicates which plates use which emulsion, which is a super important quantity that we can’t actually pull out right now.</p> The Software Citation Station (Wagg & Broekgaarden 2024) Tue, 11 Jun 2024 11:13:33 -0400 https://newton.cx/~peter/2024/citation-station/ https://newton.cx/~peter/2024/citation-station/ <p>A new software citation project is hot off the presses — <a href="https://www.tomwagg.com/software-citation-station/">The Software Citation Station</a>, as described in Wagg &amp; Broekgaarden 2024 (<a href="https://arxiv.org/abs/2406.04405">arXiv:2406.04405</a>). I was starting to build toward a discussion of this topic in <a href="https://newton.cx/~peter/2024/indexing-discoverability/">my previous post</a> so this is great timing, as far as I’m concerned!</p> <span id="continue-reading"></span> <p>In <a href="https://newton.cx/~peter/2024/indexing-discoverability/">Software Indexing and Discoverability</a> I wrote about software indices (i.e., Big Lists of All The Software; a.k.a registries, catalogs), which are strangely common in the research world, if you consider how few people in the general open-source community are interested in building such things. One oft-cited reason for indexing software is “discoverability”, a goal which I argued is actually (and perhaps, surprisingly) very much in tension with the general desire for indices to be as complete as possible.</p> <p>The other, probably dominant, motivation for indexing software is to enhance citeability. Everyone knows that software work doesn’t get enough credit in academia; not only do researchers not cite the software that they <em>should</em> be citing, but even when they want to appropriately credit a certain piece of software, it’s often not clear <em>how</em> to do so. Of course, the latter trend tends to induce the former.</p> <p>One popular approach to fix this situation is to try to index software and ask people to <em>cite through the index</em>. For instance, this is explicitly what the <a href="https://ascl.net/">Astrophysics Source Code Library</a> (ASCL) is all about. Once a piece of software is indexed in the ASCL (e.g., <a href="https://github.com/pkgw/pwkit/">pwkit</a> becomes <a href="https://ascl.net/1704.001">ascl:1704.001</a>), the index entry is potentially citeable; for instance, ASCL has an agreement with <a href="https://ui.adsabs.harvard.edu/">ADS</a> such that <a href="https://ascl.net/1704.001">ascl:1704.001</a> becomes <a href="https://ui.adsabs.harvard.edu/abs/2017ascl.soft04001W">2017ascl.soft04001W</a> with associated <a href="https://ui.adsabs.harvard.edu/abs/2017ascl.soft04001W/exportcitation">BibTeX</a> you can use in your next paper. If <em>everyone</em> puts their software in the same index, then you have a nice uniform way to cite software.</p> <p>My blunt assessment, however, is that this approach is <strong>fundamentally broken</strong>. The fundamental issue is that there is only one entity that gets to determine how to cite some scholarly product: its publisher. When I publish an article in <a href="https://apj.aas.org/">ApJ</a>, it is <a href="https://journals.aas.org/">AAS Publishing</a> that ultimately determines how that article should be cited: “People can cite your article by referencing: ApJ, volume 123, e-id 456, DOI: 10.yadda/yadda, authors Williams, ...”. A significant responsibility of the AAS Publishing organization is to ensure that such citations will remain useful into the indefinite future.</p> <p>The problem with cite-through-the-index is that the index is not the publisher. The index may record information <em>about</em> various entities but is not ultimately in control of those entities. This means that the index is going to get out of date with regards to the actual citable objects in question, both in terms of keeping up with newly-published entities and changes to existing ones. This may sound like a fixable problem, but I have become more and more convinced that it is foundational, and dooms the whole enterprise. The problem is that maintaining a high-quality index of stuff that <em>other</em> people publish is enormously labor-intensive and therefore costly. If you’re an index you need to provide something really valuable in order to be long-term sustainable. Meanwhile, publishers — people who make things citeable — are essentially fungible; think of how many different scholarly journals there are! So there will always be bargain-basement publishers; and in the field of software, publishing generally costs zero. The value provided by the publishing service is not enough to offset the costs of maintaining an index worth using.</p> <p>By constrast, consider ADS. It is also an index of published objects, but while astronomers will certainly trade around bibcodes like <a href="https://ui.adsabs.harvard.edu/abs/2022ApJ...935...16E">2022ApJ...935...16E</a> informally, when it comes time to make a formal citation, they “resolve” that bibcode to the <em>actual</em> citation, <a href="https://doi.org/10.3847/1538-4357/ac7ce8">Eftekhari et al., 2022 ApJ 935 16</a>. ADS indexes citeable items but is not construed as making them citeable itself.</p> <p>That may be so, but it doesn’t make it any cheaper to maintain ADS’s index. ADS is clearly doing something that the community finds incredibly valuable. If ADS isn’t making things citeable, what is it doing?</p> <p>In my view, the key is that ADS provides uniform search of the citation network <em>across publishers, in a regime where references between items from different publishers are both interesting and ubiquitous</em>. Articles cite each other across publisher boundaries willy-nilly, which makes it very valuable to be able to have a unified search interface that crosses publisher boundaries as well. If I don’t care about that cross-referencing network, or items within one publisher only reference other items from the same publisher, the value of an ADS-like service is a lot less clear. (This is why it’s not clear to me that “ADS for datasets” is a viable concept: the network of links between datasets is, I think, a lot less interesting.) In general, a multi-publisher index is going to provide some kind of homogeneous view of its collection, and there has to be something about that homogenization that people find valuable <em>in and of itself</em>.</p> <p>This analysis also helps us see why software citation is in such sorry shape to begin with. <strong>Software is hard to cite because it is self-published.</strong> I don’t send my latest Python package to AAS for them to review and archive it; I just upload it to GitHub myself. This is great in a lot of ways, but it turns out that high-quality publishing is harder than it looks! In particular, archiving and preservation, the bedrock of citeability, are specialized tasks that really ought to be done by professionals. Return to the analogy with traditional articles: when I publish through ApJ, I’m not told to manually upload my article to Zenodo or to figure out what the appropriate reference should be; AAS does the necessary tasks on my behalf and then tells me the result. <em>This is how it should be</em>. Amateur-hour attempts to do these things are what get you, well, the current state of software citation. Instead of a few professionals doing things in a consistent way, you get a bunch of well-meaning people trying to figure things out one at a time.</p> <p>One path to improving the state of software citation, then, is to make it easier for people to do a decent job of publishing software. Tools like <a href="https://zenodo.org/">Zenodo</a> and my <a href="https://pkgw.github.io/cranko/book/latest/">Cranko</a> aim to help on this end, and it is easy to see how the value provided by ASCL was much greater before GitHub and its ilk emerged.</p> <p>But the other half is that we have an enormous corpus of self-published stuff out there that deserves citation, and we need to make it easier for people to do so. This finally brings us back to the <a href="https://www.tomwagg.com/software-citation-station/">Software Citation Station</a>. At first blush this might seem similar to something like <a href="https://ascl.net/">ASCL</a>, but it’s different in a very important way. SCS is not attempting to provide its own citations; instead, it is like ADS, indexing entities that were published elsewhere. This is, in my view, the correct approach.</p> <p>The SCS also, I believe, captures a key insight about the role of indices in the field of software citation. If we’re trying to maintain an index of software packages that are published in a variety of external locations, the key question we have to ask is: why is anyone actually going to use this thing? What are we providing that a Google search doesn’t?</p> <p>There’s a beautiful answer: standardized information about how things should be cited! The whole key is that there will <em>never</em> be a one-size-fits-all way to cite any piece of software that’s ever been posed online. This is why people write <a href="https://libguides.mit.edu/c.php?g=551454&amp;p=3900280">whole guidebooks</a> about <a href="https://www.software.ac.uk/publication/how-cite-and-describe-software">how to cite</a> <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7805487/">software</a>, while no one sees any need to document “how to cite an article published in a major journal”. Because software is self-published and each package is a unique snowflake, citation-wise, citation instructions are unclear and are scattered across the internet in READMEs, Zenodo pages, published articles, and elsewhere. This is exactly the kind of inhomogeneous chaos that a good index can simplify. The actual citations can’t be homogenized, but the information about <em>how to cite</em> can be. Probably the <em>only</em> commonality about all the different pieces of research software out in the world is that people want to know how to cite them!</p> <p>There’s one piece, however, that I think the SCS is missing. Maybe Tom and Floor are about to get a bunch of funding, but I suspect that the maintenance of this index will be challenging. As I alluded to in <a href="https://newton.cx/~peter/2024/indexing-discoverability/">my kicker last week</a>, I can think of basically one “form factor” that is proven to at least be <em>able</em> to yield a reliable online knowledgebase without paid staff: a wiki.</p> <p>While the SCS has a form for people to submit new software, I expect that framing will discourage involvement. If SCS is missing an entry for, say, <a href="https://mosfit.readthedocs.io/">MOSFiT</a> (and it is), I probably won’t feel comfortable “submitting” a record for it unless I’m the primary author (and I’m not). But if we cast it as a wiki, then it opens the door up for me to do my best to create a record for the software, even if it’s not “mine”. Of course, maybe I’ll make a mistake and someone will have to come in and correct it, but that’s exactly how people expect to use wikis. If I’m an active maintainer of a package, I’ll want to come in and check that the SCS record is correct; but if a package becomes unmaintained, it is wholly appropriate, and perhaps necessary, for other people to keep the citation information up-to-date. For instance, maybe a project was published using some service that got shut down and absorbed into Zenodo; you don’t need to be the original author to assess whether the citation information should change to refer to the new service. While the desires of active maintainers should always take precedence, of course, there are plenty of cases where third parties are perfectly competent to maintain the citation information.</p> <p>So really what I think we need is a “Cite-o-pedia”: one page per citeable package, creatable or editable by anyone in the world. You’d certainly want some structure under the hood to record common elements like associated articles, Zenodo concept DOIs, and so on, but fundamentally the citation information might have to be free text, because the wild diversity of practices that authors want people to use.</p> <p>And with this perspective, nothing about this idea is specific to software. For instance, you could imagine a Cite-o-pedia entry for <a href="https://www.pas.rochester.edu/~emamajek/doc.html">Erik Mamajek’s star notes</a>; I know for a fact that people want to cite them even if Erik himself isn’t very concerned with that! The Cite-o-pedia concept is only specific to <em>self-published scholarly entities</em>, simply because things that come out of traditional publishers are homogeneous enough that it’s “obvious” how to cite them. </p> <p>That being said, one thing that a Cite-o-pedia could do that would dovetail well with a software focus would be helping people deal with versioning. If I’m citing a piece of software, I should ideally refer to two items: the overall software package, and the exact version that I’m using; the difference is formalized by Zenodo in the distinction between <a href="https://pkgw.github.io/cranko/book/latest/integrations/zenodo.html#orientation-software-dois">concept and version DOIs</a>. But most people are a little sloppy about this stuff. You could imagine Cite-o-pedia pages having decision trees that walk you through how to figure out which version you’re using and what, therefore, to cite. Versioning is also not specific to software: you could imagine an entry for <a href="https://www.sdss.org/">SDSS</a> that helps you figure out which data release you’re using and the appropriate citation; see also AAS’s <a href="https://doi.org/10.3847/1538-4357/aae58a">living articles</a>.</p> <p>A couple more minor comments on the SCS article:</p> <ul> <li>The article states that if you cite a piece of software, you should cite everything in its dependency graph as well. It is <em>far</em> from obvious to me that this is true: if nothing else, as a matter of practicality, the dependency graph for a large application might include hundreds of packages. As an analogy, if I’m citing an article, I don’t cite every article that <em>it</em> cites! Choosing to cite something is ultimately just that — a choice, made by the author.</li> <li>The SCS is currently hosted on the <a href="https://www.tomwagg.com/">tomwagg.com</a>; it will probably have to move to a non-personal domain to really start feeling like a community-wide resource. Fortunately, domains are pretty cheap these days.</li> </ul> <p>It’s exciting to see that people are thinking along these same lines, and the Software Citation Station feels very close to what I think the community needs. I’ve convinced myself that the “wiki” framing will be super important for success, but I’ll be very curious to see if that bears out in practice.</p> Software Indexing and Discoverability Wed, 05 Jun 2024 10:30:00 -0400 https://newton.cx/~peter/2024/indexing-discoverability/ https://newton.cx/~peter/2024/indexing-discoverability/ <p>Scientists have a thing about indexing their software. That is, a lot of them seem to think that it’s important for someone to maintain a big list of All of The Software for them to look at. (You could call this an “index,” or a “registry,” “catalog,” etc.; as far as I’m concerned these terms are all interchangeable.) This impulse is interesting to me because it’s basically absent in, say, the broader open-source software community. Why is that?</p> <span id="continue-reading"></span> <p>To pick a few concrete examples, in astronomy, we have the <a href="http://ascl.net/">Astrophysics Source Code Libary</a> (ASCL); for exoplanets there’s NASA’s <a href="https://emac.gsfc.nasa.gov/">EMAC</a>; in heliophysics there’s the <a href="https://heliopython.org/projects/">PyHC Projects list</a>. NASA has a general <a href="https://software.nasa.gov/">software catalog</a>. There’s a generic <a href="https://research-software-directory.org/software">Research Software Directory</a>. It was even quite easy for me to find at least one <a href="https://github.com/NLeSC/awesome-research-software-registries">“Awesome Research Software Registries” list</a>, which is essentially an index of scientific software indices.</p> <p>Meanwhile, if I told someone in the open-source world that I wanted to make a list of all of the open-source software, I’m quite confident that I’d get two chief reactions: “why” and “also, that is obviously impossible”. To the extent that one can make any generalizations about large and diverse communities, the mindset is just very different. The prototypical scientist, if they think about this at all, is attracted to the possible upsides of having a good index without necessarily dwelling on the work needed to construct and maintain one. The prototypical developer does the opposite.</p> <p>To cut to the chase, I’m generally on what you might call the “developer” side of things: there are a lot of attempts to index software that, frankly, I don’t feel are very valuable. But I don’t feel that way about all of them! To me, this signals that it’s important to try to think carefully about <em>when</em> and <em>why</em> these kinds of indices can be useful and successful. It’s an undertheorized topic, from what I’ve seen.</p> <p>A commonly-cited motivation for creating one of these indices is to enhance software “discoverability”. For example, the <a href="https://nspires.nasaprs.com/external/solicitations/summary.do?solId=%7b21419978-190B-811F-35A7-6D2DEEE24E4E%7d&amp;path=&amp;method=init">NASA HPOSS solicitation</a> specifically calls out “advancing access, curation, and discoverability of research software” as a priority of the NASA Open-Source Science Initiative. The rest of this post is going to focus on just this one aspect, but of course there are a variety of others at play — I’ll probably write more about them in the future.</p> <p>The discoverability angle has always puzzled me. They know that we have Google, right? Presumably there’s a feeling that general-purpose search doesn’t meet the needs of researchers, but I haven’t seen any arguments to that effect explicitly articulated.</p> <p>Signal-to-noise in search results is a possible factor, but I find that software packages are generally quite “Google-able”. It’s also true that domain-specific indices can include faceted search features not available in general-purpose engines, but I’m not at all persuaded that this is actually that important in this area. In my experience, faceted search is only really useful when you have <em>lots</em> of competing products within a well-established parameter space (shoes, hard drives). Meanwhile, research software is so bespoke that there’s hardly ever more than a handful of options that truly meet a particular need. In fact, research software is so bespoke that you’re pretty much always going to need to read the README (or equivalent) of any package you might want to use, anyway — no matter how detailed your domain-specific metadata are, your software index isn’t going to contain everything I need to make a decision. In other words, domain-specific indices are neither necessary nor sufficient for research software discovery. Google can index the README, and I’m going to need to read the README anyway.</p> <p>Furthermore, we already have a well-established domain-specific software discovery engine: the literature. If I’m doing work in a particular field and the papers keep on mentioning <a href="https://en.wikipedia.org/wiki/GADGET">GADGET</a>, well, now I know about it. And a general-purpose Google for <a href="https://www.google.com/search?hl=en&amp;q=gadget%20astrophysics">gadget astrophysics</a> yields pages upon pages of relevant resources.</p> <p>All that being said, people do discover software through means other than general-purpose text search. The Awesome Research Software Registries list mentioned above is an instance of a popular trend of making “<a href="https://github.com/topics/awesome-list">awesome lists</a>”, which generally present a big, er, list of resources in some topic that are judged to be, um, awesome. Many of them relate to software and technology, such as a list of <a href="https://github.com/awesome-selfhosted/awesome-selfhosted">software for self-hosting network services</a>, but the pattern now ranges from areas like <a href="https://github.com/DopplerHQ/awesome-interview-questions">interview questions</a> to <a href="https://github.com/sindresorhus/awesome-scifi">sci-fi novels</a>.</p> <p>These might at first seem quite similar to research software indices, but there’s an essential difference. It’s right there in the name: awesome lists are curated and selective, rather than exhaustive. This is even emphasized in very first sentences of <a href="https://github.com/sindresorhus/awesome/blob/main/awesome.md">the awesome manifesto</a>: “If you want your list to be included on [the awesome list of awesome lists], try to only include actual awesome stuff in your list. After all, it's a curation, not a collection.” The model of a “registry”, on the other hand, generally abdicates the curatorial role: the whole idea is that random people can come along and slot their item into the collection wherever they deem appropriate. There isn’t a hard demarcation here — a hurried awesome list curator might accept submissions with minimal review; any competently-run registry is going to have some level of centralized editorial oversight — but the philosophical difference is important.</p> <p>(I see this issue with <a href="https://arxiv.org/">arxiv.org</a> periodically. Many people think of it as a “registry” type service and expect their submissions to flow through the system automatically; but under the hood, it is a moderated and to some extent curated system. People often do not react well when the influence of the moderators becomes apparent.)</p> <p>There’s a potentially interesting relationship between modes of discovery and expertise to be explored here. When you’re a well-versed in a topic, general-purpose search supports discovery quite well: you know what terms to use, and you can quickly delve into results to identify which ones truly meet your needs. As a person with a lot of software experience, this is surely a big reason that research software indices don’t feel that useful to me.</p> <p>When you’re less expert, however, effective discovery requires more guidance. Curation — deference to the expertise of others — becomes more important. You’re more likely to spend time browsing a list of options, rather than narrowing it down as quickly as possible. From this standpoint, it’s to see why awesome lists are popular; I think it’s fair to say that they’re aimed at people who aren’t already experts in their fields. “Discoverability” turns out to be a relative term, depending on who exactly is trying to do the discovering.</p> <p>The problem with casting research software indices as discovery tools is that nearly all of the other motivations for their existence — e.g., enabling or promoting software citation — require them to be exhaustive, not curated. This puts them in a bind. If you’re not going to give people an opinionated list of recommendations, you need to be better than Google. It’s not hard to beat Google in terms of signal-to-noise of your results, but you also need near-perfect completeness — anything less feels like a catastrophic failure to the user.</p> <p>Unfortunately, achieving near-perfect completeness is, in general, awfully expensive. But it’s possible. One route to this kind of completeness is to provide a service so valuable that software creators are <em>compelled</em> to use it. This happens with language package systems like <a href="https://www.npmjs.com/">NPM</a> or <a href="https://crates.io/">Crates.io</a>, etc., where the value of integration with the system is indisputable. And, indeed, when I need to search for JavaScript packages I’ll do it on <a href="https://www.npmjs.com/">NPM</a> rather than Google. I don’t use any fancy faceting, but the completeness is just as good, and the signal-to-noise is better.</p> <p>The other route is to continuously put in a ton of curatorial effort, like <a href="https://ui.adsabs.harvard.edu/">ADS</a>. This is also awfully expensive, and a non-starter for all but the best-funded of projects. Unless … you can convince people to donate their effort. We can call this the Wikipedia model.</p> Envisioning the Thingiverse Tue, 21 May 2024 10:54:19 -0400 https://newton.cx/~peter/2024/thingiverse/ https://newton.cx/~peter/2024/thingiverse/ <p>A <a href="https://www.aaronland.info/weblog/2024/04/26/matrix/#usf">recent essay and talk</a> by <a href="https://www.aaronstraupcope.com/">Aaron Straup Cope</a> touches on a <em>lot</em> of different interesting ideas, but one that particularly struck me was his experiment with puttings things — in his case, a musuem collection — on the Fediverse.</p> <span id="continue-reading"></span> <p>Cope currently works at the <a href="https://www.sfomuseum.org/">SFO Museum</a>, as in, the museum embedded in the San Francisco International Airport, and seems to have a history straddling the cultural heritage and technology sectors. This sounds both super cool and, probably, often very frustrating — my impression is that museums and similar institutions have technology challenges similar to, but even bigger than, those found in scientific education and public outreach. In fields like these, if you have a visionary bent, you can imagine <em>amazing</em> new things that are possible with current technologies; and then trying to implement the most modest project is slow, grinding, often disillusioning work. Even when people buy into a particular vision, the resources needed to execute it well are often far beyond what can be marshaled — a situation I know well from projects like <a href="https://worldwidetelescope.org/home">WorldWide Telescope</a> (WWT) and the efforts to use it in education like <a href="https://www.cosmicds.cfa.harvard.edu/">Cosmic Data Stories</a>. And there are, frankly, often people in positions of power who don’t seem to have a great deal of vision beyond the status quo.</p> <p><a href="https://www.aaronland.info/weblog/2024/04/26/matrix/#usf">Cope’s essay</a> is a very nice textualization of a talk recently delivered at <a href="https://myusf.usfca.edu/arts-sciences/art-architecture/art-history-museum-studies">USF</a>. It’s on the longer side, but well worth the read. (Side note: if you’re going to go to the effort of carefully preparing a talk, it seems well worth it to go the extra mile to write it up in this form — you’ll already have done the hard work of planning the argument and preparing visuals, and the result can propagate so much farther. Even if the talk is recorded, I find “too video; didn’t watch” to be a very real thing.) </p> <p>Not being a museum person, I don’t have anything substantial to say about several of his points, except that they ring true to me. My colleague David Weigel of the <a href="https://www.rocketcenter.com/INTUITIVEPlanetarium">INTUITIVE Planetarium</a> at the <a href="https://www.rocketcenter.com/">US Space &amp; Rocket Center</a>, like Cope, spends a lot of effort on trying out ways that his experience can “follow you out of the building” — something that WWT is great for! — but I’m continually shocked-but-not-surprised at how few institutions seem to be tackling this challenge. I suspect that many individuals working at these institutions would love to do more, but just don’t have the resources they need to get anything off the ground. Sadly, this lack of resources seems to often turn into a sort of learned helplessness, rather than spurring creativity.</p> <p>Cope is really into the idea of museum visitors building up a durable, personal relationship with the items in collections. Over the years, he’s been involved in a few attempts to use technology to encourage this — I gather that the <a href="https://www.aaronland.info/weblog/2015/04/10/things/#mw2015">Cooper Hewitt Pen</a> is the best-known of these, although it’s not something that I’d heard of before. His most recent iteration is an idea that I love: creating a Fediverse/Mastodon account for every item in the SFO museum.</p> <p>One of the motivations is as follows: right now, if someone in a museum sees something that they want to remember, the most likely thing they'll do is take a picture of it on their phone. There’s nothing wrong with that, but I can imagine that as a museum curator it might feel like a hugely missed opportunity. It’s a one-time interaction, and structured information about the object in question is basically lost — in the sense that, say, if the object has a unique identifying number, there’s no reliable way for software to obtain it. Without that, even the most basic next steps that you can think of — e.g., “list all of the things that I liked at that museum” — aren’t possible.</p> <p>So: what if instead, visitors could “like and subscribe” to objects?</p> <p>Even if the object in question never posts anything, the “follow” action records the connection in a way that’s durable, bidirectional, and machine-actionable in the future. And you can immediately start thinking about things you can build based on the resulting social graph. (Which, of course, implies that privacy has to be considered carefully when building a system in practice.)</p> <p>Cope correctly points out that there are some significant practical hurdles to getting this foundational interaction to work. A small one is that there’s no easy way to convert a QR code scan into a follow, as far as I know; the much, much bigger one is that so few people are on the Fediverse. But I am extremely sympathetic to the argument that this kind of interaction <em>should</em> be on the Fediverse, and not a proprietary network, and there’s something about this vision that seems to me to be <em>so</em> sensible, in a certain way, that it actually makes me more optimistic about the Fediverse taking off in general. While I don’t see an ecosystem like the Fediverse ever having the social, viral grip of a commercial product like TikTok, I could see it gaining traction for what you might call “anti-viral” communication patterns: municipal updates about garbage collection, that kind of thing. And there’s at least the possibility that commercial experiences will bridge to the Fediverse, à la Threads, although such bridging has become a whole, contentious can of worms.</p> <p>To help motivate the focus on the Fediverse, Cope mentions that he asked someone at FourSquare if he could create 50,000 venues in bulk — one for each item in MoMA — and was, unsurprisingly, turned down. Likewise for creating 200,000 Twitter accounts. Beyond a desire to avoid vendor lock-in, which is a strong motivation in its own right, the ability to mass-create accounts is something that seems well-suited for a decentralized network. We see this with email, too. More generally, you often arrive at interesting places if you start with a service that has “user accounts” that were intended to be operated by a single human, and ask yourself: why might one person want to have 100 accounts? Why might one account might want to be associated with 100 people? Or zero? You see the same patterns pop up with things like the <a href="https://www.spamresource.com/2023/08/private-relay-vs-private-relay-vs-hide.html">Apply Private Relay email service</a> or <a href="https://www.paypal.com/us/money-hub/article/virtual-credit-card-numbers">virtual credit card numbers</a>.</p> <p>Beyond the connection-building functionality — social following is the new bookmarking, you heard it here first — you then have the fact that your objects can toot. (Which, if you’re not familiar, is what we’re supposed to call the equivalent of tweeting on Mastodon. I guess “tweet” sounded pretty silly at first, too.) Depending on your default level of optimism, this is probably either very exciting or very scary. Surely you can spark joy and build amazing connections with the right kinds of interactions. On the other hand — we have enough trouble answering support questions in person, and now you want us to monitor 50,000 inboxes? Content moderation? It’s hard for me to see how opening up all of these touch points doesn’t become a huge new source of work. You can definitely make the argument that it’s good work: if you’re a museum, what better way can you be spending your time than interacting with patrons in a sustained way? But it’s work all the same.</p> <p>I’m not at all sure whether this particular idea will take off, but I love the audacity, and I like the range of new ideas that can build on it. What if every single <a href="https://platestacks.cfa.harvard.edu/">Harvard plate</a> had a Fediverse account? Why would people decide to “follow” a given plate? What would they say to it? What could we have the plates “say” spontaneously? If people start having conversations with plates, does that collected history start becoming a sort of per-item knowledgebase? I’m particular intrigued by this last possibility. If you can somehow get people into a habit of messaging objects — @-ing is the new annotation? — it seems like a whole new mechanism for achieving the goals of projects like <a href="https://underlay.mit.edu/">The Underlay</a>.</p> <p>What might ensue if I created a Fediverse account for every post on this blog? What if every Wikipedia page was on the Fediverse, and posted about its updates? Every star in <a href="https://simbad.u-strasbg.fr/simbad/">SIMBAD</a>? Every <a href="https://somerville.mysticdrains.org/">drain in Somerville</a>? Maybe — probably — a lot of these ideas just wouldn’t work, but it would be fun to find out.</p>