PKGW

Addendum: Seven-Figure Scientific Software Projects Thu, 08 Aug 2024 12:45:41 -0400 There’s at least one important point that I forgot to make in yesterday's post</a>. I wrote that while software projects are in many ways like hardware projects, one noteworthy difference is that they’re basically made entirely out of people. While an established hardware development program is supported by durable physical assets like lab equipment, an established software development program is, to a very good approximation, completely intangible. This makes it extra-important to provide stable, sustainable support for the people involved. Of course, this is not to imply that such stability isn’t important for hardware programs too, or that people who build hardware are interchangeable. What I forgot to point out is that this is also the case for core astronomical research itself! Most established research programs in exoplanets, gamma-ray bursts, or whatever, are likewise made entirely out of people. I’m quite sure that folks with more equipment-intensive programs, like lab astrophysicists, would agree that what they’ve built is mostly about human expertise, rather than the hardware. And universities have a time-honored mechanism for recruiting and retaining the experts around whom these research programs are centered. You hire them and you give them tenure. It’s a decently successful approach. This is really the thing that’s been bothering me. It is indeed very difficult to recruit and retain the skilled professionals needed to build high-quality software in a university context. But we already have the systems in place to do this! We just need to collectively decide that software leads deserve tenure-track slots. Now it remains true that building astronomical software, like building hardware, is not quite the same thing as doing astronomical research itself. The act of writing software can (ideally) create tools that lead to new astronomical knowledge; and, something that’s pretty underrated, it can help capture and make explicit existing or previously-vague astronomical knowledge. But it doesn’t create new astronomical knowledge per se. Furthermore, I’m particularly interested in infrastructural software, which you could view as being an extra step removed from knowledge creation: tools that help you create more tools. And my impression — maybe it’s wrong — is that at the moment, to get hired onto the tenure track as a tool-builder of either the software or hardware variety, you need to have a story about how you’re going to bring in a lot of money to support your activities. A plan to operate in artisan craftsmanship mode, just you and an apprentice in the basement, won’t get you hired. Hence my focus on grantmaking in the previous post</a>. The other analogy to mention is that the commonplace uncertainty and unpredictability of the software design process is also very reminiscent of the uncertainty and unpredictability of scientific research itself. You try stuff; you realize that your first plan wasn’t working; you pivot; something surprising catches on with your colleagues. So in this way too, software development should be a very comfortable fit for academia. Innovative hardware development has the same features too, but for understandable reasons the speed at which things in software can evolve feels a lot closer to the speed at which the science itself can move. All of this makes me optimistic that there’s a way to find a home for innovative scientific software development — and developers — in the university system. But it would certainly be a lot easier to do so if we had more well-established avenues for PIs to get funding for substantial software projects. Seven-Figure Scientific Software Projects Wed, 07 Aug 2024 13:20:00 -0400 “Get this — I just got a six-million dollar grant to develop a new astronomical image viewer!” That’s not something that’s ever been said, I’ll wager. But why not? I started thinking about this in the context of a slogan that I’ve been toying with for a little while: “Every astronomy department should have a tenured software specialist”. Bold (and self-serving), I know. But it’s almost defensible, I believe, if we think optimistically based on the analogy to hardware instrumentation development. In both cases we’re talking about people who build tools. We have well-known problems giving these people enough credit, but I’d like to think that astronomers generally appreciate that our field moves forward on the basis of their work. Having a tool-builder in-house gives your faculty a leg up on the competition. Developing new tools tends to be expensive, and to require specialized skills … but by the same token, good tool-builders should be able to bring in a lot of overhead! And this is true of hardware development. Compared to your baseline generic NSF AAG</a> research grant of around $500k, hardware projects can access bigger pots of money. To pick a few awardees from the older NSF MSIP</a> program, you can get $2.5 million to build an exoplanet imaging spectrograph for Keck</a> (UC Irvine), or around $7 million for an integral field spectrograph for Magellan</a> (MIT). You can get a lot more, of course, as you scale up from individual instruments to whole facilities. If I’m applying for tenure-track positions as a person who builds software (I’m not — that ship has sailed), I want to be able to tell a story about how I’ll secure grants in that seven-figure-and-beyond range. Even if we ignore the very real fact that people do care how much overhead you bring in, this is simply the scale of funding that you need if you want to start something important that has the chance to make a lasting difference in the field. (Something like emcee</a> might be an exception, but I also bet that emcee would have a lot more impact if a few million dollars were spent on it!) For reference, typical seed funding rounds for Silicon Valley startups look like they’re $3 million</a>, and Series A rounds are larger by a factor of a few. Compared to the hardware domain, though, it’s a lot harder to tell that story. Your intuition probably screams out that you’d have zero chance of getting the NSF to hand you seven-figure sums on the basis of “I have an idea for the next Astropy</a>” or even much more specific, but still ambitious projects like “I want to build a new VLBI data reduction package”. NSF ATI</a> (grants going up to ~$2M, total pool this year of ~$8M) nominally supports software development but the framing of the program (“enable observations for ground-based astronomy that are difficult or impossible to obtain with existing means”) makes that a virtual non-starter, and I don't see any pure-software projects in the recently awarded ATI projects</a>. Now, there is CSSI</a> out of the NSF CISE</a> directorate: “Cyberinfrastructure for Sustained Scientific Innovation”. This is probably the closest in spirit to the kind of funding that I’d like to see, and the program scale is in the right ballpark. CSSI “Framework Implementation” awards come in around $2 million. But there are planned to be around ten of these given out in the 2024 round, across pretty much the whole NSF; framework implementations are “aimed at solving common research problems faced by NSF researchers in one or more areas of science and engineering”. This is all well and good, but think of the hardware analogy: would that Keck imaging spectrograph fit that definition? WorldWide Telescope</a> got a smaller CSSI Elements grant</a>, and I would love to go for a Framework Implementation, but it would be a difficult sell. In the current environment, if you want access to substantial resources for software development, you can tune your CSSI pitch, and you can try to piggyback on tangible facilities: maybe you can secure a big subaward to develop something like a pipeline for a major observatory. That’s simply where the significant pots of money for astronomical software development can actually be found — attached to very large projects like Rubin</a> and space missions. These projects are only going to support certain kinds of software development, though. Not to undersell the importance of pipelines and other facility-type software, but when I think of software efforts that ambitious “software instrumentalists” would want to be able to point to as significant professional accomplishments, I think of things like Astropy</a>, Jupyter</a>, MESA</a>, or ds9</a>, the project behind of this year’s ADASS Software Prize</a>. These are also the kind of project that we need much more of, I think. Historically, people have found ways to support work on these sorts of foundational systems through facilities funding (ds9 probably being the best example), but as funding gets tighter, software gets more expensive, and people appreciate more and more just how difficult software projects are, this approach seems less and less viable to me. There’s a much bigger problem here than simply the lack of an appropriately-targeted funding program, though. As almost everyone has come to recognize by now, most software projects are fundamentally different undertakings than hardware projects, in ways that have significant implications for how they need to be supported and managed. This is despite the fact that in other ways, software and hardware projects indeed have much in common. Consider some of the MSIP examples above. Some of the key aspects of the deliverables are extremely concrete: I will build a spectrometer with such-and-such resolving power, operating in such-and-such waveband, attaching to the back of such-and-such telescope. It’s possible to specify software deliverables in the same way: Astropy will allow users to load FITS files; CASA</a> will allow users to calibrate VLA data. You can build software this way, and sometimes you have to; but even the most straitlaced engineering organizations now understand that software-by-specification is at best a deeply limited approach. Say what you will about agile</a>, scrum</a>, and the rest, but these methods were invented because traditional ones were utterly failing in the software context. Many thousands of words have probably been written about “why software is different.” To a certain extent, the specific reasons probably aren’t even that important. But as someone who cares a lot about the quality of software, in the gestalt Zen-and-the-Art-of-Motorcycle-Maintenance sense</a>, I can say that I find the things that make certain pieces of software the most exciting and inspiring are the things that are farthest from what would be captured in a typical specification. Git</a> versus Subversion</a>, Ninja</a> versus make</a>, Beancount</a> versus hledger</a>, Rust</a> versus C++</a>: each pair of tools would likely satisfy the same written spec, but you’ll never convince me that they’re of equal quality. Anyway, all that is to say that in my view, the reason that the NSF doesn’t have a great way to give you $5 million to build the next Astropy is that everyone involved recognizes that doing so would rarely yield good results within the current framework. You could have very little confidence up-front about what was going to come out of the whole effort, and it would be really easy to spend all that money and get with something that no one actually wanted to use. The early-2000’s US NVO</a> experience isn’t exactly inconsistent with all this. I’ve been harping on the NSF here for specificity, but any traditional grantmaker is going to face the same issues. It’s true that projects that have already achieved a high level of significance can attract big grants: Jupyter landed $6 million in 2015</a>; Astropy broke through with $900k from Moore in 2019</a>. But unfortunately, it’s really, really hard to build up a compelling software project on a series of small grants. My understanding is that STScI</a> made a long-term investment on the order of tens of millions to get Astropy going, and ds9 has benefited from long-term, steady funding via Chandra</a> — funding that’s now in extreme danger thanks to Chandra’s budget being blown up</a>. PlasmaPy</a> did get $1.4 million</a> relatively early in the project history, but they likely benefited from having an extremely legible pitch: “let’s make an Astropy for plasma physics”. I’m sure that hardware development has comparable bootstrapping problems, but it seems to me that the challenges for software are going to be worse. If you’re starting out a new hardware development program, you might convince the NSF or your institution to invest in lab space, a vector network analyzer, a mass spectrometer, or whatever. If it all goes belly-up, you still have your capital investment. Software projects, on the other hand, are all opex</a> to a good approximation — people. If a project fails, you’ll have essentially nothing to show for it. What’s worse, talented people care about things like “whether they will have a job in a year“ or “what their long-term career prospects look like,” whereas vector network analyzers emphatically don’t. I believe strongly that if you want to recruit and retain good software developers, you’ll have to be able to offer them a level of stability and career growth potential that is extremely foreign to university standards. And you’re not going to get good software without good developers. So, how do we make it possible for someone to establish themselves as a “software instrumentalist”? It goes without saying that more funding wouldn’t hurt, but the key point is that if we want to enable significant, innovative, PI-driven scientific software projects — and I think we do — we need different kinds of funding. The software projects that I think, frankly, are the most interesting and valuable entail a kind of uncertainty that does not match well to traditional grant-proposal models, and the challenge is made only more difficult because building a sustainable software-production practice requires stable, substantial investment. Undoubtedly people have ideas about ways that grantmakers could do things differently to better support innovative software development. The obvious source of inspiration would be Silicon Valley: call it the “startup” model. I think the key through-line would be that the funder would have to think of itself as investing in a team of people, rather than a particular product. Startups pivot all the time, after all. Maybe your initial product idea wasn’t any good, but if you can show that you’ve become skilled at figuring out how to build something interesting that people will actually use, that’s a success. I’m not finding anything to link to at the moment, but I’m sure this sort of idea is well-trodden ground. What’s interesting is that I see elements of this approach in the design of the NASA Science Activation</a> program, NASA’s umbrella funding vehicle for science education projects. Grants are relatively large and long-lasting; oversight is relatively hands-on, with regular meetings and each project having to retain an external evaluator; and there’s a big emphasis on inter-project collaboration and the development of an overall education-focused community of practice. If I had a really big pot of money to support innovative PI-driven software projects, those would all be things that I’d want to have as well. You could also say that the upshot of all of this is that if you want to produce innovative scientific software, stay out of the universities. Go get a job at STScI</a> and convince a higher-up to peel off some money to support your vision. It’s not terrible advice, but I’d really like to think that we can do better. I think there are a ton of PI-driven software projects that could be executed for an amount of money that’s totally in line with hardware development efforts, and would deliver comparable if not much more impact for the expense — think Astropy and ADS</a>. The benefits might be extremely diffuse, but that’s exactly the kind of thing that grantmakers are supposed to figure out how to support. I don’t have a way to make money magically appear, but if it does, the key is to be able to spend with confidence. That means having better tools to estimate cost and schedule for specific software projects; a clear idea of how we’re going to do oversight; realistic models for developer retention, software adoption, and other social processes; and ultra-clear definitions of success. If we understand and even embrace the distinctive characteristics of software development, and think carefully about how those characteristics interact with our existing institutions, we can tap into an immense amount of potential. See also an addendum</a>. Mitigating “Source Splitting” in DASCH Wed, 31 Jul 2024 08:35:22 -0400 Last week I added a new feature to DASCH</a>’s new data analysis toolkit, daschlab</a>. There is now code that aims to help mitigate the “source splitting” known issue</a>, in which photometric measurements of a single source end up divided among several DASCH lightcurves. Supporting this new feature are some API changes in daschlab that will affect most existing analysis notebooks, a new tutorial notebook, and associated documentation updates. “Source splitting” is the name that I’ve given to a phenomenon that occurs in the DASCH data with some regularity. Sometimes, if you pull up the lightcurve for a decently bright star, there will only be a handful of detections, when there should instead be thousands. If you dig deeper, what you’ll usually find is that the detections of your source have been erroneously associated with other nearby stars in the reference catalog. The detections are all there, but they’re mislabeled as to which star they belong to. To understand how this happens, it helps to think about how DASCH lightcurves are constructed. The pipeline processing of each plate image is totally standard: extract a source catalog from an image, then derive astrometric and photometric calibrations. Once this is done, the DASCH pipeline “pre-compiles” lightcurve photometry: the image catalog is matched to a reference catalog (“refcat”; APASS</a> or ATLAS-REFCAT2</a>), and any matched detections are appended to a giant database of photometry for every refcat source. This scheme is basically how every time-domain lightcurve database works. (At the moment, the photometric calibration and per-image source catalog are then basically thrown away. This means that DASCH cannot support forced photometry at specific locations, among other useful operations. Hopefully we’ll be able to add this capability in the future.) One of the issues with DASCH, though, is that our astrometry is quite uncertain. A major factor here is the fundamental fact that the underlying data are analog, so we have to construct a digital astrometric solution. But also, DASCH plates often came from tiny telescopes, which means that they both cover huge areas of the sky and often have significant distortions away from the optical axis. The pipeline can do an extremely good job of astrometric calibation, but in a collection of more than 400,000 images, there are going to be errors. Personally, I’ve found it hard to get used to the fact that if you have a stack of DASCH cutouts that are nominally centered around some RA/Dec, the coordinates for some images might have significant systematic offsets. If I have a bunch of detections of a source, I’m used to, say, plotting the detection RA/Decs and flagging ones with large offsets as outliers. But in DASCH, those detections might be totally fine — it could be the WCS solution that’s wrong, not the source measurement. Making all of this worse is that the DASCH collection is so heterogeneous. Some plates in the collection have spatial resolutions 25 times better than others. For the highest-resolution plates, you might have a region where in the best cases you can accurately assign photometric measurements to any of, say, a dozen stars. For the lowest-resolution plates, all of those stars might blend together, and an astrometric solution that’s overall quite good might still cause one star to be misidentified with another. You can see how this would lead to source splitting. Finally, there’s a contributing factor that I have to confess that I don’t fully understand. The DASCH pipeline had a lot of infrastructure to search for transients, like flaring X-ray binaries. As best I can tell, the pipeline had some mechanism to identify promising transients and add them to the refcat, presumably with the goal of identifying additional outbursts if there should be any. But this functionality seems to have interacted poorly with the astrometric issues above. I believe, but am not entirely sure, that in some cases the pipeline would end up creating new refcat entries for “transients” that were really well-known sources identified with the wrong location due to astrometric calibration errors. All of this is clouded in uncertainty because I’ve chosen to completely ignore the issue of DASCH transients for the time being, so I haven’t spent any time learning about this aspect of the historical pipeline. There’s a ton of scientific potential in this area, but I think that the underlying calibrations and data products need to be improved before transient searches will really become worthwhile. The good news is that when source splitting happens, it’s generally a well-behaved phenomenon. You’ll generally find that the reference catalog contains a number of entries near your target of interest (say, 20 or 30 arcsec), and that detections of that target are in effect randomly assigned to one of those targets. So, for any given exposure containing your source, exactly one of those refcat entries has a good detection, and the rest are upper limits. “All” you need to do is merge the detections together and ignore the upper limits. The daschlab</a> software now supports this, in the function daschlab.lightcurves.merge()</code></a>. Given a set of input lightcurves, it will do precisely what I just wrote, with one minor elaboration. First, the algorithm matches up all of the lightcurves by exposure and groups together entries that are totally unambiguous: cases where there is exactly one detection among all of the lightcurves. It will then use those data to determine a mean source magnitude. Next, it will make a pass over the ambiguous cases: ones where, at a given exposure, multiple lightcurves contain detections. For those, it will choose the lightcurve point whose magnitude is closest to the mean. This is a pretty dumb heuristic, but in my samples there are only a handful of ambiguous points (fewer than ten, in lightcurves with thousands of detections), so in most cases it should be good enough. To demonstrate this code, I created a new tutorial notebook</a> for the example eclipsing binary HD 5501</a>. For whatever reason, the “official” DASCH catalog match for this source only has a few dozen detections; the vast majority of the detections are associated with a source about 10 arcsec away. The notebook demonstrates how to identify this issue, how to choose which lightcurves to merge, and how to actually do the merge. In this case, the number of usable lightcurve points goes from about 2,000 (associated with a single refcat source) to about 2,500 (merged from ten separate lightcurves). Of course, this is one of those situations where software support is nice, but the real thing to do is to actually fix the problem. I think that the pipeline’s matching code probably ignores a-priori brightness information inappropriately. If I have a measurement of 10.0 mag, I should match it with a refcat source logged with a mean brightness of 10.1 mag, even if the coordinates of the detection are nominally slightly closer to something with a brightness of 15.0 mag. And, of course, improving the astrometric calibration</a> will help. But I don’t want to get deep into mucking around with the photometry / lightcurve pipeline right now, so it seemed worthwhile to try to provide a mitigation in the meantime. To support the merge functionality, I made an update to daschlab that I’ve been avoiding for a little while. One of the key tables in a daschlab analysis session used to be the “plate list”: a table of information about every plate overlapping the coordinates of interest. But, this was actually the wrong table to offer. Why? Because some plates record multiple exposures, and each exposure has its own sky coordinates. Sometimes, these exposures were of totally different parts of the sky: an equatorial survey field and a polar calibration field, for instance. So we should really have a table of exposures that overlap your target. There might be multiple exposures on a single plate, each of which covers your target but maps to a different portion of the plate’s digital image. I realized this a while ago, but put off dealing with it because it would be a pain to make all of the necessary changes. To do the merges properly, though, I needed to get this right, so I’ve torn off the band-aid: the daschlab plate table has becomes the Exposures</code></a> table, and lots of associated aspects of the API have been reworked. This is a somewhat annoyingly invasive change, but these early days are absolutely the time to get things right. My apologies if this has messed up anyone’s existing notebooks. One additional wrinkle is that exposures come in two kinds. From the written logbooks that were created contemporaneously with the plates, we know about what exposures should exist in principle: plate C09375</a> is logged to have two exposures. But we can also identify exposures from astrometric analysis of plate images: the DASCH pipeline uses the Astrometry.Net</a> software to find an astrometric solution, analyzes it, then searches for another solution hidden in all of the sources that don’t match to the refcat using the first one. Do these two kinds of analyses always agree? Hah — of course not. The DASCH data model therefore counts both “exposures,” described in the historical logbooks, and “WCS solutions,” obtained from analysis of plate images. When possible, WCS solutions are matched to known exposures. But sometimes you have more exposures than solutions, and sometimes you have more solutions than exposures. If a plate has been scanned, some exposures might be mappable to the scan image (if it has an associated WCS solution), and some might not. While exposures lacking WCS solutions obviously don’t have high-quality astrometric information, we do have basic information about their rough sky positioning, so we’re still interested in them. The daschlab exposures list therefore includes all available information. This means that some exposures in the list can be mapped to a digital image, and others can’t — and the latter situation may occur even if the plate in question has been scanned. Correcting this situation caused me to come to understand an important limitation of the DASCH mosaic-cutout software. In both the “Cannon” legacy DASCH data access portal</a> and daschlab, the cutout-generating software doesn’t know about multiple exposures. If you request a plate cutout at a certain position, you’ll get a cutout using the first derived WCS solution (which may or may not be the first logged exposure), even if perhaps those coordinates only land on the plate using one of the other WCS solutions. Due to the way that the legacy pipeline handled multiple-exposure plates, I can’t easily fix this issue immediately. But, in the aftermath of my reprocessing of the DASCH astrometry</a>, the associated data products are going to be significantly improved, and I’ll address this problem. What TeX Gets Right Wed, 17 Jul 2024 08:22:08 -0400 Last week I wrote about how LaTeX might evolve to stay relevant</a> to 21st-century technical writing. But technology has come a long way since the 1970’s. Should we even be encouraging people to create documents using the venerable TeX language, which was designed at a time when computers — and computing — were so different than they are today? This week I want to write a bit about why it’s worth the effort to build on TeX/LaTeX, instead of starting fresh. (This post is strongly derived from “Why TeX?”</a>, an explainer I’ve written as part of my prototyping of the Tectonopedia</a>.) First things first: I'll happily admit that there are plenty of circumstances where TeX is not the best solution, and you'll be better off using some other kind of technology — whether that's Markdown</a>, Microsoft Word</a>, pen and paper, or whatever else. The very notion of “creating documents” is so broad that it should go without saying that no single system is going to be the best choice for every situation. That being said, my work on Tectonic is inspired by the belief that despite its age TeX is still the very best tool in the world for solving certain kinds of problems. For instance, if you know one thing about TeX, it's probably that it's good for mathematics. And that reputation is well-earned! A proficient TeX user can easily write a single line of code to conjure up complex typeset equations. But what should actually impress you more than long, complex equations are equations typeset inline with body text, like y = x². Readability requires that the placement, sizing, and appearance of the math and text symbols all agree well, issues that can be fudged a bit in “display” equations. TeX is one of the few tools out there that can get all these things just right. (This blog is not typeset using TeX, but the above equation is also quite simple; I’d say that it looks OK but not great in my browser.) But we wouldn't be writing all this verbiage in honor of a finicky math layout algorithm. Why can't other tools just copy TeX’s algorithms and do math equally well? My claim is that the real challenge of typesetting mathematics is that written math is an open-ended, generative visual language, admitting infinitely varied forms in unique, unpredictable, recursive combinations. TeX can handle math well not because it got some specific fiddly bits right — although it did — but because TeX is itself an expressive, open-ended language. You need an open-ended language to be able to reproduce the open-endedness of written math. Other tools give you building blocks; only TeX gives you the machinery to create and use new blocks of your own. (This is not an equivalent statement, but for what it’s worth, the TeX language is Turing-complete</a>.) And your own “blocks” can be just as easy and natural to use as the built-in ones. This is far from the only reason that TeX is good at what it does, but it's the most essential one. The power of TeX manifests itself not only in low-level ways, such as math typesetting, but in higher-level ones as well. For instance, scholarly documents have detailed conventions for handling bibliographic references. Although neither the core TeX language nor Microsoft Word have any built-in, structured way to represent reference metadata, TeX has been extended to support them in the BibTeX</a> framework. BibTeX's new commands don't feel like awkward extensions: they integrate straightforwardly with the rest of the language and are intuitive for users. While it's true that you can manually typeset your references in Word and assemble them into a bibliography, it's fair to say BibTeX provides a fundamentally more powerful way to work with them. I’ll wager that most people who have gotten the hang of BibTeX would hate the idea of giving it up and going back to managing their references manually. Now, if you’re a healthy skeptic of abstraction, you'll likely respond: “whoa, I don't want some elaborate system that can do anything — I just want a tool that helps me get the job done”. This is the right response! Human lifetimes have been wasted on the refinement of elegant but useless ideas, and we have deadlines to meet. But hopefully that you'll agree that in some situations, a system of abstraction is exactly what you need to get the job done. Try doing physics without calculus. I'll assert, but can't possibly prove, that once you stop accepting the limitations that less powerful tools impose on you, you’ll start seeing opportunities to use TeX's capabilities everywhere. Many kinds of documents have a sort of “internal logic” that becomes easier to express given the right tools. That being said, the ones where TeX's capabilities generally add the most are ones where this internal logic is easy to find: documents with lots of cross-references, figures, tables, equations, or code — the things that I call technical documents. And it’s likely no coincidence that these are the sorts of documents that TeX has historically most often been used to create. It’s worth emphasizing that this “internal logic” of certain documents can be open-ended and generative in the same way that written math notation is. For instance, I often find in API documentation that certain software frameworks introduce new conceptual structures that don’t exist in the underlying implementation language. One example would be the notion of store state</a> introduced in the Pinia</a> framework. You can write API documentation that discusses this state in terms of concrete JavaScript/TypeScript statements, but it’s really a new concept that deserves to be documented as a “first-class” object on equal footing with other pre-existing language concepts like classes and methods — I should be able to reference “the isAdmin</code> state variable of the Server</code> store” in a convenient and natural way. A documentation framework needs to give you the tools to give yourself the tools to do this. I’ll also boldly claim that despite its internal sophistication, TeX is easy to start using. I can't deny that TeX has a reputation for confusing output and sometimes inexplicable behavior, or that there are reasons that this reputation is deserved. Nevertheless, I'll point out that many mathematicians and scientists who do not care at all about its guts successfully use it for everyday work, even if it drives them off the wall at times. You can think of it as being a bit like git</a> in this way. Now, it’s certainly possible that one could develop a new, generative typesetting language that captures the virtues that I’ve discussed above and is free of TeX’s historical baggage. If you asked me to take on that task, I’d ask for … maybe a decade to do it? Designing a nicer syntax is one thing; building a whole new ecosystem is another. TeX may be old, but by the same token it is battle-tested and amazingly reliable — its parser can recover from the most pathological, hateful input documents that you can conceive of. While this kind of robustness often comes at a performance cost, TeX is fast and ingeniously efficient. It is supported by a worldwide community of users, who have gone to incredible lengths to modernize it and develop a dizzying array of extension packages and supporting software tools. A lot of very smart people have put a lot of effort into this language, which is still going strong after forty years — and those facts tell you something important. That is not to imply that today's TeX is perfect — far from it</a>. The error messages are famously hard to understand. Its documentation is, ironically, a mess. Indeed, a major premise of the Tectonic project</a> is that some aspects of the TeX ecosystem are in need of dramatic change. But not all of them. TeX can be the document language that the 21st century deserves. LaTeX Can Be The New LaTeX Tue, 09 Jul 2024 11:29:08 -0400 There’s a lot of interest in modernizing the tools for scientific and technical typesetting. Tools like MyST</a>, Nota</a>, Quarto</a>, Living Papers</a>, showyourwork!</a>, and many others are trying to make it easier — well, possible — to create technical documents that take advantage of the capabilities of today’s digital environment. Of these systems, different ones aim to work at different levels (low-level typesetting, vs. complete document authoring systems) but we can broadly think of them as tools aiming to become “the new LaTeX”. I’m not sure if I’ve made the argument in writing before, but I believe that “the new LaTeX” could, maybe even should, be founded on … LaTeX. Not LaTeX as it currently exists, but LaTeX as it could be. There are two parts to this argument. The first is that TeX/LaTeX (hereafter just TeX, to emphasize the underlying language) gets some things right that are worth preserving; the second is that its problems are fixable. This post is going to skip the first part, and simply observe that here we are in the year 2024 still using TeX for precision technical typesetting, despite its age and all of its problems. It is probably the oldest piece of end-user software that’s still in common usage — TeX predates Unix! It’s worth pondering why TeX is still relevant, but clearly it’s doing something right. A huge amount of human effort and ingenuity has gone into designing the TeX ecosystem; if we can build on that instead of throwing it all away, that’s a huge win for everyone. That being said, TeX as it currently exists absolutely has a ton of issues. It’s annoying to install. The error messages are inscrutable. The syntax is fully of booby traps. It only truly targets PDF output. There are a dozen different ways to do the same thing and you need to be an expert to understand which is appropriate for your situation. It can feel impossible to dig up useful documentation on even the most basic commands. And so on, and so on. The ultimate causes of virtually all of these problems, I’d claim, are that TeX is simply very, very old, and that from the start the culture of TeX has been obsessed with stability. Neither of these are intrinsically bad things, of course! But there are also advantages to being less laden with historical baggage. For instance, a major issue with advancing the (La)TeX ecosystem is simply that the core TeX code has traditionally been nearly impossible to hack on. I often mention to people that when I first got interested in trying to modify the core TeX engine, it took me, an expert in this kind of task, something like several weeks to even figure out where the core source code actually lived, and how to compile it myself. That’s, like, the fundamental step involved in getting someone to contribute to your open-source project. If people can’t easily fork your project and try out changes, then no, you’re not going to streamline your installation process, or gradually evolve quality-of-life improvements. The other huge issue that I see is that TeX’s documentation is, to be blunt, awful. People sometimes seem to have trouble honing in on this — I mean, isn’t LaTeX literally a document preparation system? Aren’t there thousands and thousands of pages of text been written about every CTAN</a> package under the sun? And yet. Despite all of that effort, I find that existing TeX documentation generally fails completely at serving my day-to-day needs. I routinely struggle to pull up convenient, clear reference material about how a certain command works, or the design of certain fundamental concepts of the TeX language. The irony is staggering. But we can see how things got to where they are. First, there’s absolutely a first-mover penalty at play here. When TeX was invented, there was no World Wide Web. Software documentation came printed on paper. So that how the whole ecosystem evolved: targeting print formats. The inexorable outcome of that is that these days, all of the ecosystem documentation is locked up in PDFs. They're often very nice PDFs, but that doesn’t do anything to help searchability, cross-linkage, and quick access. Can you think of any software system created in the past, say, decade whose documentation is primarily delivered in PDF format? Second, the TeX world’s obsession with stability led to fragmentation: if you wanted to add a new feature, you had to fork the mainline TeX engine and call it ε-tex</a>, or ptex</a>, or uptex</a>, or pdftex</a>, or xetex</a>, or luatex</a>, or whatever else. One of the many unfortunate consequences of this is that the documentation has become both fragmented, and highly conditional. The information you need might be out there, but associated with a different engine; or a piece of information might be engine-specific without being labeled as such; or a really thorough piece of documentation might address a bunch of engine options but become nigh-unreadable by virtue of being mired in choices like “if you’re using XeTeX, do this; if you’re using LuaTeX, do that”. Documentation authors can’t assume that certain features are available because, hey, maybe you’re still using Knuth’s original TeX with only 256 registers. The longevity of TeX further complicates things. If you want to look up information about font selection, you might get material that was written before TrueType was invented. This is a good problem to have, in a certain sense, but it makes things more difficult for readers. This is especially true when combined with the ecosystem’s fragmentation and the often baked-in assumption that key elements will never change, so that there’s no need to plan for providing multiple versions of documentation. A particular culprit is The TeXbook</a>. The TeXbook is, undoubtedly, an enormous accomplishment and a stunningly whole work, and it remains the definitive reference for the design and behavior of the innermost aspects of the TeX engine. But it’s also pretty bad documentation. If nothing else, it took me a while to appreciate that many things written in The TeXbook are simply not true of modern TeX systems, and haven’t been true for decades; now-fundamental features added in ε-TeX (1996) just don’t exist, as far as The TeXbook is concerned, let alone ones added in more modern extensions like XeTeX. The TeXbook has a further problem that I see mirrored in other major pieces of documentation in the TeX ecosystem. To borrow the terminology of Diátaxis</a>, it’s trying to be two kinds of documentation at once: not just a detailed reference of the precise behavior of the system, but also a comprehensive explanation of the system design. These are both really wonderful kinds of documentation to have, but interleaving the two types of material makes the book hard to work with. Descriptions of behavior are scattered around the book based on the overall pedagogical flow; the explanations are repeatedly derailed by precise definitions. It may feel churlish to criticize the book for being too comprehensive but it is a legitimate usability problem if people can’t find the information they need, when they need it! There’s also the basic matter of availability. It is possible to find PDFs of The TeXbook online, but they’re not supposed to exist. Reading between the figurative lines, I’m pretty sure that Knuth had a specific business plan: he’d give the software away for free, but make some money by selling the documentation. Nothing wrong with that choice at the time, but once again: notice that this is not how anyone does things anymore. When you actually have to compete for mindshare, high-quality documentation that’s freely available is one of the most valuable assets you can have. All of this is to say: while there may be a large quantity of documentation about the TeX ecosystem, I find that sadly it’s quite difficult to actually profit from. This is especially unfortunate since TeX is indeed a very complex system that, largely due to its age, does some things in ways that are quite foreign to modern users. The flip side of this is: imagine if TeX’s documentation was as sophisticated and cohesive as the underlying software. It would change everything. I sincerely believe that true best-in-class documentation would completely transform how people feel about using TeX. Instead of being some mysterious anachronism, full of ancient magic, it would be the quintessential power tool, sometimes challenging to master but ready to solve the hardest problems you can throw at it. It would go from dinosaur to thoroughbred. The Tectonic</a> project is, of course, motivated by this vision. At the basic technical level, much of the Tectonic approach is simply about making the TeX engine hackable. As elaborated in the Tectonic TUGBoat article</a>, at the higher level, the project’s distinctive branding gives us the room to break with the past when appropriate. I can’t deny that launching a new brand only worsens the fragmentation issue, but I think it’s a necessary step to unstick everything else that needs improvement. Thus far the Tectonic project has not provided much at all in the way of new, high-quality documentation, but that’s something of an intentional choice — I’m dissatisfied with the currently available tools and am trying to sketch out a better system in the tectonopedia</a> project. Part of that effort is building a wholly-new set of engine features and LaTeX classes to generate “native” HTML output — that is, targeting original TeX source documents aimed solely at producing optimal HTML outputs, instead of trying to generate good HTML from existing documents. My hope is that Tectonic can help TeX to become a sort of assembly language for complex documents. In plenty of cases, Markdown will be all you need. But if you want to create a sophisticated technical-document-as-web-application</a>, it will probably be one of the final tools that runs before your web bundler, generating precision-typeset prose components and cross-referencing databases, and linking them together with all of the other elements of your document: interactive figures, navigational chrome, and the rest. You might ask: “why does this assembly language need to be TeX?” That gets to the question of what TeX gets right — a topic for another post</a>. Reprocessing DASCH’s Astrometry Wed, 26 Jun 2024 09:30:00 -0400 Yesterday I completed a large effort to reprocess all of DASCH’s</a> astrometry. We now have astrometric solutions for 415,848 plates, and with this groundwork laid I plan to start working on reprocessing the full DASCH photometric database. There were a couple of motivations for this reprocessing effort. The DASCH astrometric calibrations have been a bit of a mess, because the astrometric solutions were stored as WCS headers attached directly to the “mosaic” FITS files. This is absolutely the most obvious way to store this information, but I’d argue that it’s actually not a great approach from a data-management standpoint. In this scheme if you re-calibrate a mosaic, you need to rewrite the entire FITS file, opening up risks of data loss and confusion due to changing file contents. There’s also no clear way to, say, try a new calibration scheme and compare the results. You could see evidence of the inflexibility of this approach: the DASCH filesystem indicates a mosaic’s level of astrometric calibration in its filename (raw, “WW” for basic localization, “TNX” for distortion correction), and there were lots of mosaics where the filename was inconsistent with the actual FITS headers. There were also mosaics with very old calibrations using IRAF headers that I’ve never even seen before, as well as a few ones seemingly lingering from very early work (e.g., some files with a “W” tag in their name). There were also a lot of files that looked like they should be easily solvable that were missing calibrations, in addition to ones with solutions that were bogus. The latter situation is documented on the DASCH website as the “Incorrect Astrometry” known issue</a>, and it’s the one that bothered me the most. One of the great things about the Astrometry.Net</a> software, which is the basis of DASCH’s astrometric solutions, is that it should virtually never produce false positives: if it hands you a solution, that solution is pretty much guaranteed to be pretty good. So, if the DASCH solutions include bogus ones, that means that the pipeline is starting with a decent Astrometry.Net solution and breaking it. I dug into this issue and discovered that it seemed like the primary culprit was the stage of the DASCH astrometric pipeline that comes after Astrometry.Net. Once we get the initial solutions, we feed them into the WCSTools</a> program imwcs</a>, which attempts to refine the solutions. Unfortunately, imwcs</a> uses a numerical optimizer but doesn’t have any conception of a global goodness-of-fit, which means that if the optimizer somehow converges on an incorrect solution, it’s very hard to step back and say “no, I don’t like where we ended up”. imwcs</a> is also an extremely old tool written in gnarly C, so it’s challenging to improve the code. I did end up creating a DASCH-specific fork of imwcs</a> and modifying a few behaviors. For instance, imwcs</a> optimizes to match the list of sources in your image against a set of catalog sources, which is constructed from a list of the brightest sources in an all-sky catalog in an RA/Dec box around your image’s initial position. This is fine most of the time, but in DASCH where some images are seventy degrees tall, a box in RA/Dec can be vastly bigger than the actual image area, so that downselecting to the brightest sources means that only a handful of reference sources are actually on your image. So, I added a step to limit the catalog results to ones that actually overlap the image’s initial position. (If your image is going to shift over a bit, this might mean that you miss out on a handful of sources that might be useful, but since we’re starting with an Astrometry.Net solution, we know that we’re starting very close to our destination, so the vast majority of useful sources will be covered by our initial guess of the image footprint.) Another significant improvement wasn’t within imwcs</a> itself, but related to the list of image sources fed into it. This list was derived from a SExtractor</a> table in a pretty naive way, once again just selecting the brightest sources. This scheme could fail badly for plates with really inhomogeneous backgrounds, cracks, and other defects. Once again, you would end up with a source list that only included a handful of actual stars from the image, which is great way to get the optimizer to drive your solution somewhere unfortunate. There were a lot of other small fixes as well, trying to improve the robustness of the pipeline. In the end, the success rate went from 94% to 97% — or, put another way, I was able to eliminate fully half of the solution failures. What I haven’t yet been able to look into is the number of incorrect solutions coming out of the new codebase. I think that it should be a lot lower thanks to the kinds of improvements I mentioned above, but it’s surprisingly hard to check into this quantitatively. For any given plate, it’s easy to identify a way-off solution by eye, but automating such an analysis to cover the full diversity of the DASCH collection — some plates contain 50 stars, some contain 200,000 — is a lot harder. There are also borderline cases, often relating to plates with multiple exposures. A plate with multiple exposures doesn’t have “an” astrometric solution — it has a set of several solutions. Any given location on the plate has several different, valid RA/Dec coordinates, and a single RA/Dec position may appear at multiple pixel positions on the plate! As you might expect, this can get pretty gnarly to deal with. The DASCH pipeline works by starting with an initial list of all of the sources extracted from a plate image, and then peeling away the sources that match the catalog after the best astrometric calibration is arrived at. The remaining unmatched sources are then fed into Astrometry.Net again, and you iterate until Astrometry.Net stops finding solutions. (So, this process depends highly on Astrometry.Net’s lack of false positives!) These plates can fail in a few hard-to-handle ways, though. Particularly tricky are the ones containing close multiple exposures, like A21562</a>. If you have two exposures very close to one another, the imwcs</a> optimizer is vulnerable to what I call the “split-the-difference” effect, where it converges to a solution that lands right between each source pair. In a least-squares sense this is indeed the optimal solution, but it’s incorrect. You can also get an effect where if, say, the source pairs are in the going in the left-right direction, the solution has a global skew where it’s bang-on for the left sources in the top-left of the plate, but bang-on for the right sources in the bottom-right of the plate. This one is even trickier to detect since you can have really high-quality matches across large areas of the plate — and since many plates have large-area defects, you can’t expect a plate to have high-quality solutions everywhere. All of these problems get even hairier for plates with many exposures, like C21253</a>, which appears to contain 11 exposures in a tight spatial sequence. As best I can see, these plates will need to be handled by some preprocessing. You could make a histogram of pixel separations for every pair of sources in an image, and then decompose the peaks to infer how many close-multiple exposures there are. (Not all multiple exposures are close: some plates have, say, one exposure on the celestial equator, and one at the pole.) For instance, if a plate has four exposures, you might find up to six peaks in the source-pair separation histogram: one for each pair of exposures. (But you might not find that many if peaks coincide; e.g., for a sequence of exposures with a roughly regular spacing on the plate.) You could potentially increase the power of the search by adding a delta-instrumental-magnitude axis to the histogramming process, since if you have a pair of exposures where one has half the duration of the other, the repeated source images should all appear fainter by a constant factor, modulo the nonlinearity of the photographic medium. All of this analysis could be done before any astrometric or photometric calibration, and then you could use it to filter down the source lists that you feed into Astrometry.Net and imwcs, hopefully preventing issues like split-the-difference. I’ve looked into all of this in a preliminary way, but unfortunately I don’t know if I’ll get the chance to really pursue this idea. Anyway. Due to issues like the above, there might be more astrometric reprocessing in my future, but hopefully we’ve improved the baseline significantly. The updated astrometric database includes 415,848 solutions, and everything is based on Gaia DR2</a> via the ATLAS-REFCAT2</a> catalog. This project also got me to really straighten out my understanding of how to handle multiple-exposure plates, as well as some of the technical infrastructure surrounding them. They represent only a fraction of the corpus, but I think it’s important to deal with them properly. The processing took a span of around 46 days all told, starting with a list of 428,416 mosaics. Over the course of that time I did a lot of work to speed the running of the pipeline, so that if I had to restart everything now, I think that I should be able to finish in 17 days or fewer. Fun fact: I only made one improvement to optimize the actual DASCH pipeline code, to fix some catastrophic failures where a certain step would sometimes take ~forever. Literally everything else was about making more optimal use of the Cannon</a> cluster resources, which was enough to speed up my processing by a factor of five or more. It pays to understand your platform, folks! The biggest win there was making a change to avoid the Slurm</a> scheduler as much as possible and use my own job-dispatching system instead. It’s a little disappointing, in a certain sense, to be able to beat the scheduler at its own game so badly, but on the other hand Slurm is definitely optimized for highly-parallel simulations, not lots of little interdependent data-processing jobs. As a final point, these new astrometric solutions are not yet available in the DASCH data access services</a>. As I mentioned at the outset, the legacy DASCH data organization just serves up whatever WCS headers are attached to each FITS image. The reprocessed approach stores the astrometric data separately, as small data packages that I’m calling “results”. Eventually, these results will be made available as their own data products, and the various data access services will combine the mosaic imagery and the astrometric results on-the-fly as appropriate. But, I need to write and deploy the code to do all of that. In the meantime, if you’re interested in the new solutions, get in touch. Now that I have this new baseline of astrometry results, though, I can turn to the next step: reprocessing all of the photometry. Like the DASCH astrometric data, the DASCH photometric databases contain 18+ years of accumulated cruft, and I’m very eager to clean them out! I’ll also use the “result” framework to do a better job of exposing the photometric calibration data, which I expect to be very valuable for people who want to dig into the DASCH lightcurves in detail. In particular, I should finally be able to surface the information that indicates which plates use which emulsion, which is a super important quantity that we can’t actually pull out right now. The Software Citation Station (Wagg & Broekgaarden 2024) Tue, 11 Jun 2024 11:13:33 -0400 A new software citation project is hot off the presses — The Software Citation Station</a>, as described in Wagg & Broekgaarden 2024 (arXiv:2406.04405</a>). I was starting to build toward a discussion of this topic in my previous post</a> so this is great timing, as far as I’m concerned! In Software Indexing and Discoverability</a> I wrote about software indices (i.e., Big Lists of All The Software; a.k.a registries, catalogs), which are strangely common in the research world, if you consider how few people in the general open-source community are interested in building such things. One oft-cited reason for indexing software is “discoverability”, a goal which I argued is actually (and perhaps, surprisingly) very much in tension with the general desire for indices to be as complete as possible. The other, probably dominant, motivation for indexing software is to enhance citeability. Everyone knows that software work doesn’t get enough credit in academia; not only do researchers not cite the software that they should be citing, but even when they want to appropriately credit a certain piece of software, it’s often not clear how to do so. Of course, the latter trend tends to induce the former. One popular approach to fix this situation is to try to index software and ask people to cite through the index. For instance, this is explicitly what the Astrophysics Source Code Library</a> (ASCL) is all about. Once a piece of software is indexed in the ASCL (e.g., pwkit</a> becomes ascl:1704.001</a>), the index entry is potentially citeable; for instance, ASCL has an agreement with ADS</a> such that ascl:1704.001</a> becomes 2017ascl.soft04001W</a> with associated BibTeX</a> you can use in your next paper. If everyone puts their software in the same index, then you have a nice uniform way to cite software. My blunt assessment, however, is that this approach is fundamentally broken. The fundamental issue is that there is only one entity that gets to determine how to cite some scholarly product: its publisher. When I publish an article in ApJ</a>, it is AAS Publishing</a> that ultimately determines how that article should be cited: “People can cite your article by referencing: ApJ, volume 123, e-id 456, DOI: 10.yadda/yadda, authors Williams, ...”. A significant responsibility of the AAS Publishing organization is to ensure that such citations will remain useful into the indefinite future. The problem with cite-through-the-index is that the index is not the publisher. The index may record information about various entities but is not ultimately in control of those entities. This means that the index is going to get out of date with regards to the actual citable objects in question, both in terms of keeping up with newly-published entities and changes to existing ones. This may sound like a fixable problem, but I have become more and more convinced that it is foundational, and dooms the whole enterprise. The problem is that maintaining a high-quality index of stuff that other people publish is enormously labor-intensive and therefore costly. If you’re an index you need to provide something really valuable in order to be long-term sustainable. Meanwhile, publishers — people who make things citeable — are essentially fungible; think of how many different scholarly journals there are! So there will always be bargain-basement publishers; and in the field of software, publishing generally costs zero. The value provided by the publishing service is not enough to offset the costs of maintaining an index worth using. By constrast, consider ADS. It is also an index of published objects, but while astronomers will certainly trade around bibcodes like 2022ApJ...935...16E</a> informally, when it comes time to make a formal citation, they “resolve” that bibcode to the actual citation, Eftekhari et al., 2022 ApJ 935 16</a>. ADS indexes citeable items but is not construed as making them citeable itself. That may be so, but it doesn’t make it any cheaper to maintain ADS’s index. ADS is clearly doing something that the community finds incredibly valuable. If ADS isn’t making things citeable, what is it doing? In my view, the key is that ADS provides uniform search of the citation network across publishers, in a regime where references between items from different publishers are both interesting and ubiquitous. Articles cite each other across publisher boundaries willy-nilly, which makes it very valuable to be able to have a unified search interface that crosses publisher boundaries as well. If I don’t care about that cross-referencing network, or items within one publisher only reference other items from the same publisher, the value of an ADS-like service is a lot less clear. (This is why it’s not clear to me that “ADS for datasets” is a viable concept: the network of links between datasets is, I think, a lot less interesting.) In general, a multi-publisher index is going to provide some kind of homogeneous view of its collection, and there has to be something about that homogenization that people find valuable in and of itself. This analysis also helps us see why software citation is in such sorry shape to begin with. Software is hard to cite because it is self-published. I don’t send my latest Python package to AAS for them to review and archive it; I just upload it to GitHub myself. This is great in a lot of ways, but it turns out that high-quality publishing is harder than it looks! In particular, archiving and preservation, the bedrock of citeability, are specialized tasks that really ought to be done by professionals. Return to the analogy with traditional articles: when I publish through ApJ, I’m not told to manually upload my article to Zenodo or to figure out what the appropriate reference should be; AAS does the necessary tasks on my behalf and then tells me the result. This is how it should be. Amateur-hour attempts to do these things are what get you, well, the current state of software citation. Instead of a few professionals doing things in a consistent way, you get a bunch of well-meaning people trying to figure things out one at a time. One path to improving the state of software citation, then, is to make it easier for people to do a decent job of publishing software. Tools like Zenodo</a> and my Cranko</a> aim to help on this end, and it is easy to see how the value provided by ASCL was much greater before GitHub and its ilk emerged. But the other half is that we have an enormous corpus of self-published stuff out there that deserves citation, and we need to make it easier for people to do so. This finally brings us back to the Software Citation Station</a>. At first blush this might seem similar to something like ASCL</a>, but it’s different in a very important way. SCS is not attempting to provide its own citations; instead, it is like ADS, indexing entities that were published elsewhere. This is, in my view, the correct approach. The SCS also, I believe, captures a key insight about the role of indices in the field of software citation. If we’re trying to maintain an index of software packages that are published in a variety of external locations, the key question we have to ask is: why is anyone actually going to use this thing? What are we providing that a Google search doesn’t? There’s a beautiful answer: standardized information about how things should be cited! The whole key is that there will never be a one-size-fits-all way to cite any piece of software that’s ever been posed online. This is why people write whole guidebooks</a> about how to cite</a> software</a>, while no one sees any need to document “how to cite an article published in a major journal”. Because software is self-published and each package is a unique snowflake, citation-wise, citation instructions are unclear and are scattered across the internet in READMEs, Zenodo pages, published articles, and elsewhere. This is exactly the kind of inhomogeneous chaos that a good index can simplify. The actual citations can’t be homogenized, but the information about how to cite can be. Probably the only commonality about all the different pieces of research software out in the world is that people want to know how to cite them! There’s one piece, however, that I think the SCS is missing. Maybe Tom and Floor are about to get a bunch of funding, but I suspect that the maintenance of this index will be challenging. As I alluded to in my kicker last week</a>, I can think of basically one “form factor” that is proven to at least be able to yield a reliable online knowledgebase without paid staff: a wiki. While the SCS has a form for people to submit new software, I expect that framing will discourage involvement. If SCS is missing an entry for, say, MOSFiT</a> (and it is), I probably won’t feel comfortable “submitting” a record for it unless I’m the primary author (and I’m not). But if we cast it as a wiki, then it opens the door up for me to do my best to create a record for the software, even if it’s not “mine”. Of course, maybe I’ll make a mistake and someone will have to come in and correct it, but that’s exactly how people expect to use wikis. If I’m an active maintainer of a package, I’ll want to come in and check that the SCS record is correct; but if a package becomes unmaintained, it is wholly appropriate, and perhaps necessary, for other people to keep the citation information up-to-date. For instance, maybe a project was published using some service that got shut down and absorbed into Zenodo; you don’t need to be the original author to assess whether the citation information should change to refer to the new service. While the desires of active maintainers should always take precedence, of course, there are plenty of cases where third parties are perfectly competent to maintain the citation information. So really what I think we need is a “Cite-o-pedia”: one page per citeable package, creatable or editable by anyone in the world. You’d certainly want some structure under the hood to record common elements like associated articles, Zenodo concept DOIs, and so on, but fundamentally the citation information might have to be free text, because the wild diversity of practices that authors want people to use. And with this perspective, nothing about this idea is specific to software. For instance, you could imagine a Cite-o-pedia entry for Erik Mamajek’s star notes</a>; I know for a fact that people want to cite them even if Erik himself isn’t very concerned with that! The Cite-o-pedia concept is only specific to self-published scholarly entities, simply because things that come out of traditional publishers are homogeneous enough that it’s “obvious” how to cite them. That being said, one thing that a Cite-o-pedia could do that would dovetail well with a software focus would be helping people deal with versioning. If I’m citing a piece of software, I should ideally refer to two items: the overall software package, and the exact version that I’m using; the difference is formalized by Zenodo in the distinction between concept and version DOIs</a>. But most people are a little sloppy about this stuff. You could imagine Cite-o-pedia pages having decision trees that walk you through how to figure out which version you’re using and what, therefore, to cite. Versioning is also not specific to software: you could imagine an entry for SDSS</a> that helps you figure out which data release you’re using and the appropriate citation; see also AAS’s living articles</a>. A couple more minor comments on the SCS article: The article states that if you cite a piece of software, you should cite everything in its dependency graph as well. It is far from obvious to me that this is true: if nothing else, as a matter of practicality, the dependency graph for a large application might include hundreds of packages. As an analogy, if I’m citing an article, I don’t cite every article that it cites! Choosing to cite something is ultimately just that — a choice, made by the author.</li> The SCS is currently hosted on the tomwagg.com</a>; it will probably have to move to a non-personal domain to really start feeling like a community-wide resource. Fortunately, domains are pretty cheap these days.</li> </ul> It’s exciting to see that people are thinking along these same lines, and the Software Citation Station feels very close to what I think the community needs. I’ve convinced myself that the “wiki” framing will be super important for success, but I’ll be very curious to see if that bears out in practice. Software Indexing and Discoverability Wed, 05 Jun 2024 10:30:00 -0400 Scientists have a thing about indexing their software. That is, a lot of them seem to think that it’s important for someone to maintain a big list of All of The Software for them to look at. (You could call this an “index,” or a “registry,” “catalog,” etc.; as far as I’m concerned these terms are all interchangeable.) This impulse is interesting to me because it’s basically absent in, say, the broader open-source software community. Why is that? To pick a few concrete examples, in astronomy, we have the Astrophysics Source Code Libary</a> (ASCL); for exoplanets there’s NASA’s EMAC</a>; in heliophysics there’s the PyHC Projects list</a>. NASA has a general software catalog</a>. There’s a generic Research Software Directory</a>. It was even quite easy for me to find at least one “Awesome Research Software Registries” list</a>, which is essentially an index of scientific software indices. Meanwhile, if I told someone in the open-source world that I wanted to make a list of all of the open-source software, I’m quite confident that I’d get two chief reactions: “why” and “also, that is obviously impossible”. To the extent that one can make any generalizations about large and diverse communities, the mindset is just very different. The prototypical scientist, if they think about this at all, is attracted to the possible upsides of having a good index without necessarily dwelling on the work needed to construct and maintain one. The prototypical developer does the opposite. To cut to the chase, I’m generally on what you might call the “developer” side of things: there are a lot of attempts to index software that, frankly, I don’t feel are very valuable. But I don’t feel that way about all of them! To me, this signals that it’s important to try to think carefully about when and why these kinds of indices can be useful and successful. It’s an undertheorized topic, from what I’ve seen. A commonly-cited motivation for creating one of these indices is to enhance software “discoverability”. For example, the NASA HPOSS solicitation</a> specifically calls out “advancing access, curation, and discoverability of research software” as a priority of the NASA Open-Source Science Initiative. The rest of this post is going to focus on just this one aspect, but of course there are a variety of others at play — I’ll probably write more about them in the future. The discoverability angle has always puzzled me. They know that we have Google, right? Presumably there’s a feeling that general-purpose search doesn’t meet the needs of researchers, but I haven’t seen any arguments to that effect explicitly articulated. Signal-to-noise in search results is a possible factor, but I find that software packages are generally quite “Google-able”. It’s also true that domain-specific indices can include faceted search features not available in general-purpose engines, but I’m not at all persuaded that this is actually that important in this area. In my experience, faceted search is only really useful when you have lots of competing products within a well-established parameter space (shoes, hard drives). Meanwhile, research software is so bespoke that there’s hardly ever more than a handful of options that truly meet a particular need. In fact, research software is so bespoke that you’re pretty much always going to need to read the README (or equivalent) of any package you might want to use, anyway — no matter how detailed your domain-specific metadata are, your software index isn’t going to contain everything I need to make a decision. In other words, domain-specific indices are neither necessary nor sufficient for research software discovery. Google can index the README, and I’m going to need to read the README anyway. Furthermore, we already have a well-established domain-specific software discovery engine: the literature. If I’m doing work in a particular field and the papers keep on mentioning GADGET</a>, well, now I know about it. And a general-purpose Google for gadget astrophysics</a> yields pages upon pages of relevant resources. All that being said, people do discover software through means other than general-purpose text search. The Awesome Research Software Registries list mentioned above is an instance of a popular trend of making “awesome lists</a>”, which generally present a big, er, list of resources in some topic that are judged to be, um, awesome. Many of them relate to software and technology, such as a list of software for self-hosting network services</a>, but the pattern now ranges from areas like interview questions</a> to sci-fi novels</a>. These might at first seem quite similar to research software indices, but there’s an essential difference. It’s right there in the name: awesome lists are curated and selective, rather than exhaustive. This is even emphasized in very first sentences of the awesome manifesto</a>: “If you want your list to be included on [the awesome list of awesome lists], try to only include actual awesome stuff in your list. After all, it's a curation, not a collection.” The model of a “registry”, on the other hand, generally abdicates the curatorial role: the whole idea is that random people can come along and slot their item into the collection wherever they deem appropriate. There isn’t a hard demarcation here — a hurried awesome list curator might accept submissions with minimal review; any competently-run registry is going to have some level of centralized editorial oversight — but the philosophical difference is important. (I see this issue with arxiv.org</a> periodically. Many people think of it as a “registry” type service and expect their submissions to flow through the system automatically; but under the hood, it is a moderated and to some extent curated system. People often do not react well when the influence of the moderators becomes apparent.) There’s a potentially interesting relationship between modes of discovery and expertise to be explored here. When you’re a well-versed in a topic, general-purpose search supports discovery quite well: you know what terms to use, and you can quickly delve into results to identify which ones truly meet your needs. As a person with a lot of software experience, this is surely a big reason that research software indices don’t feel that useful to me. When you’re less expert, however, effective discovery requires more guidance. Curation — deference to the expertise of others — becomes more important. You’re more likely to spend time browsing a list of options, rather than narrowing it down as quickly as possible. From this standpoint, it’s to see why awesome lists are popular; I think it’s fair to say that they’re aimed at people who aren’t already experts in their fields. “Discoverability” turns out to be a relative term, depending on who exactly is trying to do the discovering. The problem with casting research software indices as discovery tools is that nearly all of the other motivations for their existence — e.g., enabling or promoting software citation — require them to be exhaustive, not curated. This puts them in a bind. If you’re not going to give people an opinionated list of recommendations, you need to be better than Google. It’s not hard to beat Google in terms of signal-to-noise of your results, but you also need near-perfect completeness — anything less feels like a catastrophic failure to the user. Unfortunately, achieving near-perfect completeness is, in general, awfully expensive. But it’s possible. One route to this kind of completeness is to provide a service so valuable that software creators are compelled to use it. This happens with language package systems like NPM</a> or Crates.io</a>, etc., where the value of integration with the system is indisputable. And, indeed, when I need to search for JavaScript packages I’ll do it on NPM</a> rather than Google. I don’t use any fancy faceting, but the completeness is just as good, and the signal-to-noise is better. The other route is to continuously put in a ton of curatorial effort, like ADS</a>. This is also awfully expensive, and a non-starter for all but the best-funded of projects. Unless … you can convince people to donate their effort. We can call this the Wikipedia model. Envisioning the Thingiverse Tue, 21 May 2024 10:54:19 -0400 A recent essay and talk</a> by Aaron Straup Cope</a> touches on a lot of different interesting ideas, but one that particularly struck me was his experiment with puttings things — in his case, a musuem collection — on the Fediverse. Cope currently works at the SFO Museum</a>, as in, the museum embedded in the San Francisco International Airport, and seems to have a history straddling the cultural heritage and technology sectors. This sounds both super cool and, probably, often very frustrating — my impression is that museums and similar institutions have technology challenges similar to, but even bigger than, those found in scientific education and public outreach. In fields like these, if you have a visionary bent, you can imagine amazing new things that are possible with current technologies; and then trying to implement the most modest project is slow, grinding, often disillusioning work. Even when people buy into a particular vision, the resources needed to execute it well are often far beyond what can be marshaled — a situation I know well from projects like WorldWide Telescope</a> (WWT) and the efforts to use it in education like Cosmic Data Stories</a>. And there are, frankly, often people in positions of power who don’t seem to have a great deal of vision beyond the status quo. Cope’s essay</a> is a very nice textualization of a talk recently delivered at USF</a>. It’s on the longer side, but well worth the read. (Side note: if you’re going to go to the effort of carefully preparing a talk, it seems well worth it to go the extra mile to write it up in this form — you’ll already have done the hard work of planning the argument and preparing visuals, and the result can propagate so much farther. Even if the talk is recorded, I find “too video; didn’t watch” to be a very real thing.) Not being a museum person, I don’t have anything substantial to say about several of his points, except that they ring true to me. My colleague David Weigel of the INTUITIVE Planetarium</a> at the US Space & Rocket Center</a>, like Cope, spends a lot of effort on trying out ways that his experience can “follow you out of the building” — something that WWT is great for! — but I’m continually shocked-but-not-surprised at how few institutions seem to be tackling this challenge. I suspect that many individuals working at these institutions would love to do more, but just don’t have the resources they need to get anything off the ground. Sadly, this lack of resources seems to often turn into a sort of learned helplessness, rather than spurring creativity. Cope is really into the idea of museum visitors building up a durable, personal relationship with the items in collections. Over the years, he’s been involved in a few attempts to use technology to encourage this — I gather that the Cooper Hewitt Pen</a> is the best-known of these, although it’s not something that I’d heard of before. His most recent iteration is an idea that I love: creating a Fediverse/Mastodon account for every item in the SFO museum. One of the motivations is as follows: right now, if someone in a museum sees something that they want to remember, the most likely thing they'll do is take a picture of it on their phone. There’s nothing wrong with that, but I can imagine that as a museum curator it might feel like a hugely missed opportunity. It’s a one-time interaction, and structured information about the object in question is basically lost — in the sense that, say, if the object has a unique identifying number, there’s no reliable way for software to obtain it. Without that, even the most basic next steps that you can think of — e.g., “list all of the things that I liked at that museum” — aren’t possible. So: what if instead, visitors could “like and subscribe” to objects? Even if the object in question never posts anything, the “follow” action records the connection in a way that’s durable, bidirectional, and machine-actionable in the future. And you can immediately start thinking about things you can build based on the resulting social graph. (Which, of course, implies that privacy has to be considered carefully when building a system in practice.) Cope correctly points out that there are some significant practical hurdles to getting this foundational interaction to work. A small one is that there’s no easy way to convert a QR code scan into a follow, as far as I know; the much, much bigger one is that so few people are on the Fediverse. But I am extremely sympathetic to the argument that this kind of interaction should be on the Fediverse, and not a proprietary network, and there’s something about this vision that seems to me to be so sensible, in a certain way, that it actually makes me more optimistic about the Fediverse taking off in general. While I don’t see an ecosystem like the Fediverse ever having the social, viral grip of a commercial product like TikTok, I could see it gaining traction for what you might call “anti-viral” communication patterns: municipal updates about garbage collection, that kind of thing. And there’s at least the possibility that commercial experiences will bridge to the Fediverse, à la Threads, although such bridging has become a whole, contentious can of worms. To help motivate the focus on the Fediverse, Cope mentions that he asked someone at FourSquare if he could create 50,000 venues in bulk — one for each item in MoMA — and was, unsurprisingly, turned down. Likewise for creating 200,000 Twitter accounts. Beyond a desire to avoid vendor lock-in, which is a strong motivation in its own right, the ability to mass-create accounts is something that seems well-suited for a decentralized network. We see this with email, too. More generally, you often arrive at interesting places if you start with a service that has “user accounts” that were intended to be operated by a single human, and ask yourself: why might one person want to have 100 accounts? Why might one account might want to be associated with 100 people? Or zero? You see the same patterns pop up with things like the Apply Private Relay email service</a> or virtual credit card numbers</a>. Beyond the connection-building functionality — social following is the new bookmarking, you heard it here first — you then have the fact that your objects can toot. (Which, if you’re not familiar, is what we’re supposed to call the equivalent of tweeting on Mastodon. I guess “tweet” sounded pretty silly at first, too.) Depending on your default level of optimism, this is probably either very exciting or very scary. Surely you can spark joy and build amazing connections with the right kinds of interactions. On the other hand — we have enough trouble answering support questions in person, and now you want us to monitor 50,000 inboxes? Content moderation? It’s hard for me to see how opening up all of these touch points doesn’t become a huge new source of work. You can definitely make the argument that it’s good work: if you’re a museum, what better way can you be spending your time than interacting with patrons in a sustained way? But it’s work all the same. I’m not at all sure whether this particular idea will take off, but I love the audacity, and I like the range of new ideas that can build on it. What if every single Harvard plate</a> had a Fediverse account? Why would people decide to “follow” a given plate? What would they say to it? What could we have the plates “say” spontaneously? If people start having conversations with plates, does that collected history start becoming a sort of per-item knowledgebase? I’m particular intrigued by this last possibility. If you can somehow get people into a habit of messaging objects — @-ing is the new annotation? — it seems like a whole new mechanism for achieving the goals of projects like The Underlay</a>. What might ensue if I created a Fediverse account for every post on this blog? What if every Wikipedia page was on the Fediverse, and posted about its updates? Every star in SIMBAD</a>? Every drain in Somerville</a>? Maybe — probably — a lot of these ideas just wouldn’t work, but it would be fun to find out.