scipy.stats Cheat Sheet

Key methods of the distribution classes in scipy.stats.











FunctionFacts
pdf
  • Probability density function
  • Probability of obtaining x < q < x+dx is pdf(x)dx
  • Derivative of CDF
  • Goes to 0 at ±∞ for anything not insane
  • Not invertible because it’s hump-shaped!
cdf
  • Cumulative distribution function
  • Probability of obtaining q < x is cdf(x)
  • Integral of CDF
  • CDF = 1 – SF
  • cdf(-∞) = 0 ; cdf(+∞) = 1
ppf
  • Percent-point function (inverse CDF)
  • If many samples are drawn, a fraction z will have values q < ppf(z).
  • PPF = inverse of CDF
  • Domain is zero to unity, inclusive; range indeterminate, possibly infinite.
sf
  • Survival function
  • Probability of obtaining q > x is sf(x)
  • SF = 1 – CDF
  • sf(-∞) = 1 ; sf(+∞) = 0
isf
  • Inverse survival function
  • If many samples are drawn, a fraction z will have values q > ppf(z).
  • ISF = inverse of SF (duh)
  • Domain is zero to unity, inclusive; range indeterminate, possibly infinite.
logpdf
  • Log of PDF
logcdf
  • Log of CDF
logsf
  • Log of SF

Elementary Gaussian Processes in Python

Gaussian processes are so hot right now, but I haven’t seen examples of the very basic computations you do when you’re “using Gaussian processes”. There are tons of packages that do these computations for you — scikit-learn, GPy, pygp —but I wanted to work through some examples using, and showing, the basic linear algebra involved. Below is what I came up with, as incarnated in an IPython notebook showing a few simple analyses.

This post is also a pilot for embedding IPython notebooks on this blog. Overall it was pretty straightforward, though I had to insert a few small tweaks to get the layout to work right — definitely worth the effort, though! I haven’t really used an IPython notebook before but I gotta say it worked really well here. I generally prefer the console for getting work done, but it’s a really nice format for pedagogy.

Confidence intervals for Poisson processes with backgrounds

For some recent X-ray work, I’ve wanted to compute confidence intervals on the brightness of a source given a known background brightness. This is straightforward when the quantities in question are measured continuously, but for faint X-ray sources you’re in the Poisson regime, and things get a little trickier. If you’ve detected 3 counts in timespan τ, and you expect that 1.2 of them come from the background, what’s the 95% confidence interval on the number of source counts?

Of course, the formalism for this has been worked out for a while. Kraft, Burrows, and Nousek (1991) describe the fairly canonical (222 citations) approach. Their paper gives a lot of tables for representative values, but the formalism isn’t that complicated, so I thought I’d go ahead and implement it so that I can get values for arbitrary inputs.

Well, I wrote it, and I thought I’d share it in case anyone wants to do the same calculation. Here it is — in Python of course. There are a few subtleties but overall the calculation is indeed pretty straightforward. I’ve checked against the tables in KBN91 and everything seems hunky-dory. Usage is simple:

from kbn_conf import kbn_conf

n = 3 # number of observed counts
b = 1.2 # expected number of background counts
cl = 0.95 # confidence limit
source_rate_lo, source_rate_hi = kbn_conf (n, b, cl)

# here, source_rate_lo = 0, source_rate_hi = 6.61 --
# we have an upper limit on the source count rate.

Get in touch if you have any questions or suggestions!

CASA in Python without casapy

Like several large applications, CASA bundles its own Python interpreter. I can totally understand the decision, but sometimes it’s really annoying when you want to combine CASA’s Python modules with personal ones or those from another large package.

Fortunately, it’s not actually that hard to clone the CASA modules so that they can be accessed by your system’s Python interpreter — with the major caveat that the procedure might totally fail if the two different interpreters aren’t binary-compatible. I’ve had success in the two attempts I’ve made so far, though.

Really all you do is copy the key files. There’s a wrinkle, however, in that you need to set up the dynamically-linked libraries so that they can all find each other. This can all work automatically with the right RPATH/RUNPATH settings in the binary files, but empirically 99% of programmers are too clueless to use them correctly. Grrr. Fortunately, a tool called patchelf helps us fix things up.

Anyway, here’s how to equip an arbitrary Python interpreter with the key casac module — subject to binary compatibility of the Python module systems. I’m assuming Linux and 64-bit architecture; changes will be needed for other kinds of systems.

  1. Download and install patchelf. It’s painless.
  2. Download and unpack a CASA binary package. We’ll call the CASA directory {CASA}.
  3. Identify a directory that your Python interpreter will search for modules, that you can write to. The global directory is something like /usr/lib64/python2.7/site-packages/, but if you have a directory for personal python modules listed in your $PYTHONPATH environment variable, that’s better. We’ll call this directory {python}.
  4. Customize the following short script to your settings, and run it:
    #! /bin/sh
    
    casa={CASA} # customize this!
    python={python} # customize this!
    
    cd $casa/lib64
    
    # copy basic Python files
    cp -a python2.7/casac.py python2.7/__casac__ $python
    
    # copy dependent libraries, with moderate sophistication
    for f in lib*.so* ; do
      if [ -h $f ] ; then
        cp -a $f $python/__casac__ # copy symlinks as links
      else
        case $f in
          *_debug.so) ;; # skip -- actually text files
          libgomp.*)
            # somehow patchelf fries this particular file
            cp -a $f $python/__casac__ ;;
          *)
            cp -a $f $python/__casac__
            patchelf --set-rpath '$ORIGIN' $python/__casac__/$f ;;
        esac
      fi
    done
    
    # patch rpaths of Python module binary files
    cd $python/__casac__
    for f in _*.so ; do
      patchelf --set-rpath '$ORIGIN' $f
    done
    
  5. At this point you can blow away your unpacked CASA tree, though certain
    functionality will require files in its data/ directory.

All this does is copy the files (casac.py, __casac__/, and dependent shared libraries) and then run patchelf on the shared libraries as appropriate. For some reason patchelf fries the libgomp library, but that one doesn’t actually need patching anyway.

After doing this, you should be able to fire up your Python interpreter and execute

import casac

successfully, showing that you’ve got access to the CASA Python infrastructure. You can then use the standard CASA “tools” like this (assuming you’re using CASA version > 4.0; things were different before):

import casac
tb = casac.casac.table ()
ms = casac.casac.ms ()
ms.open ('vis.ms')
print ms.nrow ()
ms.close ()

I’ve written some modules that provide higher-level access to functionality relying only on the casac module: casautil.py for low-level setup (in particular, controlling logging without leaving turds all over your filesystem), and tasklib.py for a scripting-friendly library of basic CASA tasks, with a small shim called casatask to provide quick command-line access to them. With these, you can start processing data using CASA without suffering the huge, irritating overhead of the casapy environment.

Note: for Macs, I believe that instead of patchelf, the command to run is something like install_name_tool -add_rpath @loader_path libfoo.dylib — but I haven’t tested this.

Announcing: worklog-tools, for automating tedious CV activities

There are a lot of annoyances surrounding academic CV’s. Making a document that looks nice, for one. Maintaining different versions of the same information — short and full CV’s, PDF and HTML formats. Remembering how you’ve categorized your talks and publications so that you know where to file the latest one.

For me, one of the great frustrations has been that a CV is full of useful information, but that information is locked up in a format that’s impossible to do anything useful with — besides generate a CV. I’d like to collect citation statistics for my publications, and my CV contains the needed list of references, but I can’t pull them out for automatic processing. Likewise for things like previously-awarded NSF grants (hypothetically …) and lists of collaborators in the past N years. Some of these things are just interesting to know, and others are needed by agencies and employers.

Well, problem solved. Enter worklog-tools.

Based on the issues I’ve mentioned above, I feel like it’s pretty clear what you want to do: log CV-type activities — your academic output — in some kind of simple data format, and populate some kind of LaTeX template with information from the log. While we’re at it, there’s no need to restrict ourselves to LaTeX — we can also fill in an HTML template for slick, web-native versions of the same information.

I’ve actually gone and done this. There are a lot of ways you could implement things, but here’s what I do:

  • I log activities in a series of records in simple “INI format” files named 2011.txt, 2012.txt, etc.
  • Special hooks for publication records tie in to ADS to fetch citation information and compute things like h-indices.
  • A simple command-line tool fills in templates using information from these files, in the form of either arbitrary data from the raw records, or more specialized derived data like citation statistics.

Two components of this system are data — the log files and the templates. One component is software — the glue that derives things and fills in the templates. The worklog-tools are that software. They come with example data so you can see how they work in practice.

As is often the case, most of the work in this project involved making the system less complicated. I also spent a lot of time documenting the final design. Finally, I also worked to put together some LaTeX templates that I think are quite nice — you can judge the results for yourself.

Is any of this relevant to you? Yes! I sincerely think this system is straightforward enough that normal people would want to use it. A tiny bit of Python hacking is needed for certain kinds of changes, but the code is very simple. I think my templates are pretty nice — and I’m happy for people to use them. (If nothing else, you might learn some nice LaTeX tricks.) Finally, I think the value-add of being able to do things like collect citation statistics is pretty nice — and of course, you can build on this system to do whatever “career analytics” you want. For instance, I log all of my submitted proposals, so I can compute success rates, total time allocated, and so on.

The README on GitHub has many more details, including instructions about how to get started if you want to give it a try. I hope you enjoy!

By the way: INI format is highly underrated as a simple data format. It’s almost as underrated as XML is overrated.

By the way #2: Of course, nothing in this system makes it specific to CV’s — with different templates and log files, you can insert structured data into any kind of document.

By the way #3: Patches and pull requests are welcome! There are a zillion features that could be added.

By the way #4: A lot of this work was probably motivated by the fact that my name isn’t very ADS-able — a search for P Williams pulls in way too many hits, and though I can’t get a search for exactly PKG Williams to work, I have a fair number of publications without the middle initials.

Trends in ultracool dwarf magnetism: Papers I and II

Well, the pair of papers that I’ve been working on for much of this year have finally hit arxiv.org, showing up as 1310.6757 and 1310.6758. I’m very happy with how they turned out, and it’s great to finally get them out the door!

These papers are about magnetism in very small stars and brown dwarfs, which we refer to as “ultracool dwarfs” or UCDs. Observations show that UCDs produce strong magnetic fields that can lead to large flares. However, the internal structure of these objects is very different than that of the Sun (no radiative core), in a way that makes it challenging to develop a theory of how UCDs produce their magnetic fields, and of what configuration those fields assume.

So we turn to observations for guidance. Our papers present new observations of seven UCDs made with the Chandra space telescope, detecting X-rays, and the recently-upgraded Very Large Array, detecting radio waves. Magnetic short circuits (“reconnection events”) are understood to lead to both X-ray and radio emission, and observations in these bands have turned out to provide very useful diagnostics of magnetism in both distant stars and the Sun.

When people such as my boss started studying UCD magnetism, they soon discovered that that the radio and X-ray emission of these small, cool objects has several surprising features when compared to Sun-like stars. We hope that by understanding these surprising observational features, we can develop a better theoretical understanding of what’s going on “under the hood.” This in turn will help us grapple with some challenging basic physics and also inform our understanding of what the magnetic fields of extrasolar planets might be like, which has large implications for their habitability (e.g.).

The first paper considers the ratio of radio to X-ray brightness. While this ratio is fairly steady across many stars, in some UCDs the radio emission is much too bright. The second paper considers X-ray brightness as a function of rotation rate. UCDs tend to rotate rapidly, and if they were Sun-like stars this would lead to them having fairly bright X-ray emission regardless of their precise rotation rate. But instead, they have depressed levels of X-ray emission, and the faster they rotate the fainter they seem to get.

Our papers make these effects clearer than ever, thanks to both the new data and to work we did to build up a database of relevant measurements from the literature. I’m really excited about the database since it’s not a one-off effort; it’s an evolving, flexible system inspired by the architecture of the Open Exoplanet Catalogue (technical paper here). It isn’t quite ready for prime time, but I believe the system to be quite powerful and I hope it can become a valuable, living resource for the community. More on it anon.

One of the things that the database helps us to see is that even if you look at two UCDs that are superficially similar, their properties that are influenced by magnetism (e.g., radio emission) may vary widely. This finding matches well with results from studies using an entirely unrelated technique called Zeeman-Doppler imaging (ZDI). The researchers using ZDI can measure certain aspects of the UCD magnetic fields directly, and they have concluded that these objects can generate magnetic fields in two modes that lead to very different field structures. These ideas are far from settled — ZDI is a complex, subtle technique — but we’ve found them intriguing and believe that the current observations match the paradigm well.

One of my favorite plots from the two papers is below. The two panels show measurements of two UCD properties: X-ray emission and magnetic field strength, with the latter being a representative value derived from ZDI. Each panel plots these numbers as a function of rotation (using a quantity called the Rossby number, abbreviated “Ro”). The shapes and colors further group the objects by mass (approximately; it’s hard to measure masses directly).

X-rays and magnetic field versus rotation.
X-rays and magnetic field versus rotation. There’s scatter, but the general trends in the two parameters (derived from very different means) are surprisingly similar. From 1310.6758.

What we find striking is that even though the two panels show very different kinds of measurements, made with different techniques and looking at different sets of objects, they show similar trends: wide vertical scatter in the green (lowest-mass) objects; low scatter and high values in the purple (medium-mass) objects; and low scatter with a downward slope in the red (relatively high-mass) objects. This suggests to us that the different field structures hypothesized by the ZDI people result in tangible changes in standard observational quantities like X-ray emission.

In our papers we go further and sketch out a physical scenario that tries to explain the data holistically. The ZDI papers have argued that fast rotation is correlated with field structure; we argue that this can explain the decrease of X-rays with rotation, if the objects with low levels of X-rays have a field structure that produces only small “short circuits” that can’t heat gas to X-ray emitting temperatures. But if these short circuits manage to provide a constant supply of energized electrons, that could explain the overly bright radio emission. The other objects may produce fewer, larger flares that can cause X-ray heating but are too infrequent to sustain the radio-emitting electrons. (There are parallels of this idea in studies of the X-ray flaring properties of certain UCDs.)

Our papers only sketch out this model, but I think we provide a great jumping-off point for more detailed investigation. What I’d like to do for Paper III is do a better job of measuring rotation; right now, we use a method that has some degeneracies between actual rotational rate and the orientation of the object with regards to Earth. Some people have argued that orientation is in fact important, so using different rotation metrics could help test our model and the orientation ideas. And of course, it’s important to get more data; our “big” sample has only about 40 objects, and we need more to really solidly investigate the trends that we think we see.

One great part of this project is that I worked on it not only with my boss Edo Berger, but also with a fantastic summer REU student from Princeton, Ben Cook. Ben’s the lead author of Paper II and he did fantastic work on many aspects of the overall effort. It was a pleasure working with him and I suspect he’ll do quite well in the years to come.

Landscape pages, emulateapj, and arxiv

The past couple of times that I’ve put up a paper on arxiv.org, I’ve had trouble with pages rotated to landscape mode. Here’s a quick note on what the issue is.

I’ve been using the pdflscape package to get landscape-oriented pages. It seems to be the standard LaTeX solution for this kind of thing. There’s also an lscape package; they’re almost the same, but if you’re using pdflatex the former inserts some magic to get the relevant page of the output file automatically turned when viewed on your computer.

The problem is that lscape and pdflscape have some kind of problem with the revtex4-1 document class, and revtex4-1 is what drives emulateapj under the hood. When you use them, the content on every page in the document is rotated in a bad-news kind of way. It took me a while to figure this out because I had only revtex4 installed on my laptop; emulateapj can work with either, and the older version doesn’t have the issue. arxiv.org does have revtex4-1, so the problem would seem to only appear for Arxiv submission.

Recent versions of emulateapj have a [revtex4] option to force use of the older version of the package, which should make this go away. I don’t know if Arxiv’s version of emulateapj is new enough to support this.

Alternatively, revtex comes with its own environment for rotating pages, turnpage. You can use this instead. Here’s an example of rotating a wide single-page deluxetable at the end of a document:

...
\clearpage
\begin{turnpage}
\begin{deluxetable} ...
\end{deluxetable}
\end{turnpage}
\clearpage
\global\pdfpageattr\expandafter{\the\pdfpageattr/Rotate 90}
\end{document}

The \pdfpageattr line gets the magic PDF page rotation to happen. Obviously, this assumes that pdflatex is being used.

I do not know if this works with multipage tables. More annoyingly, you still seem to be forced to place the table at the end of the document, which is lame. At least, I don’t know how to get the deluxetable onto its own page without seriously messing with the flow of the surrounding document. (From what I read, the \pdfpageattr command lasts until you change it, so if you somehow got a rotated table into the middle of the document, you’d also need to figure out how clear the PDF page rotation afterward.)

Flashy interactive graphics in a scientific talk, realized

Following up on my previous post, I spent an hour hacking together my slide template with a little project called d3po, a hack from the recent dotAstronomy 5 conference that’s put together some demos presenting astronomical data with the d3 toolkit.

Here’s the result. It’s just a quick simple hack showing that this is possible, which in a way isn’t too impressive since there’s absolutely no reason it shouldn’t have been. But it’s still fun to see everything work.

Source code is available in the documents themselves, since this is the web!

Slides for scientific talks in HTML

As a Linux user, I’ve long been making slides for my talks in LibreOffice Impress (formerly OpenOffice Impress), the Free Software clone of PowerPoint. I don’t envy anyone who’s trying to maintain compatibility with Microsoft Office products, but frankly Impress has been slow and frustrating and buggy, and my use of it has been grudging at best.

Recently, however, I encountered an unusually wonderful bug where graphics in EPS format showed up everywhere except on the screen for the actual presentation — I confidently started my talk and moved to my first science slide, only to find an empty black expanse, which most of my subsequent slides were as well. Everything was fine on my laptop. And I said to myself: effffff this.

Today I happen to be at Bucknell University, where I just gave a colloquium to an audience of mostly physics undergraduates. I hadn’t given a talk aimed at this kind of audience before, so I had to make a bunch of new slides anyway. So, early last week, I took the plunge and tried to see if I could prepare my slides using HTML and show them in Firefox.

Why HTML? Well, I really like the idea of having my slides being stored in text format, so I can version-control them, edit quickly, and so on. And crucially, web browsers are now sophisticated enough that you can make great-looking slides in them — I don’t think this was true five years ago. It also seems really cool to have the option of embedding YouTube videos, Flash animations, interactive JavaScript demos, and all that kind of stuff. I didn’t do much of that in this talk, but I did put in a bunch of movies, which I’d never dared to do with Impress.

The experiment was a resounding success. I found a framework, reveal.js, that made it easy to prepare the slides and supported a few features that are pretty vital for scientific talks:

  • Ability to make a PDF out of the talk, in case disaster strikes or you’re not allowed to use your own computer.
  • Ability to use PDF graphics in slides (straightforwardly done with Mozilla’s pdf.js).
  • A “presenter console” mode with offscreen speaker notes.
  • As a bonus, easy embedding of relatively nice-looking equations with MathJax.

It took some time to get the appearance to be how I wanted and to figure out how best to construct slide layouts, but all told it wasn’t too bad. I loved being able to edit and rearrange my slides efficiently, and I’m very happy with the aesthetic appearance of the final product.

I put the presentation source code on GitHub so you can see how I did it. I did devise a few new tricks (such as working out how to embed PDFs), but most of the work was just using the stock reveal.js toolkit and tweaking the styling. Here’s a live, online version of the talk so you can see how it looks.

I wrote up some more detailed notes on the repository README. To be honest, this approach probably isn’t right for most astronomers — most astronomers have Mac laptops with Keynote, which is a lot better than Impress. And I needed to draw on a lot of technical background in order for the slide construction to feel “easy”. But once I got the template set up, it was quick and fun to make slides, and now here’s a template that you can use too!

Big thanks to Hakim El Hattab for making and publishing reveal.js, as well as to the authors of the various CSS/JavaScript tools I used. It was kind of incredible how easy it was to achieve some fancy, beautiful effects.

Prosser & Grankin, CfA Preprint 4539: A Bibliographic Story

A little while ago I was collecting published data on the rotation of Sun-like stars. As one often finds, there are helpful papers out there that compile lots measurements into a few big tables. If you’re going to be thorough (and of course you are!), a standard part of the job is chasing down the references to make sure that you honestly say that you know where all of the data came from. Pizzolato et al. 2003 is one of these compilation papers, and a very nice one at that. It has very good referencing, which was why I was surprised to see that some of the measurements were attributed to

Prosser, C. F., & Grankin, K. N. 1997, CfA preprint, 4539

Not a respectable paper in a respectable refereed journal, but some random preprint. Now, cutting-edge work can genuinely rely on results so new that they haven’t yet appeared in print, but that’s not quite the case here: Pizzolato et al. is from 2003, while the preprint citation is to 1997. Presumably the work hadn’t been “in preparation” for six years.

This kind of thing generally happens when authors are reading a paper, find some results that they want to use, and just blindly copy the reference down for use in their own work. Citing an old preprint, like in this case, is a bit of a red flag: it’s an academic no-no to be citing things that you haven’t actually read, and if you’d actually read the paper, you’d have updated the citation to refer to its published form.

In the spectrum of academic sins, I consider this one to be relatively minor, and I suspect it’s committed pretty often. It’d be nice if Pizzolato et al. had chased down the refereed paper that the preprint became, but sometimes people get lazy. There’s still enough information to dig up the published article, so it’s an inconvenience but not much worse. If this kind of blindly-copied reference gets propagated through several generations of papers (which I have seen before), you run the risk of a game-of-telephone situation where the meaning of the original work is warped; but in this case the reference is just providing measurements, which ought to propagate from one generation to the next pretty reliable.

Since I expected to use the relevant data, and I did feel like I should double-check that, say, the named preprint even exists, I set out to hunt down the final published form. In this day and age this is usually easy: everything modern is on ArXiv, where preprints almost always get cross-linked to their corresponding published paper when the latter appears. (I figure this must happen more-or-less automatically, since most authors don’t bother to fill in that information after they’ve made their posting.) Older preprints are more work, but generally not hard to track down thanks to ADS. If the citation has a title, you’re pretty much set; if not, it’s easy to search for something published by the lead author around the time of the preprint. No big deal.

I started doing my usual ADS searches: no “Prosser & Grankin, 1997”. No “Prosser & Grankin, 1998”. The “Prosser et al., 1998” papers clearly weren’t relevant. Hmm. A little bit more searching, and in fact there are no papers in the ADS database with both Prosser and Grankin as coauthors. Authorship lists can change between preprint-dom and publication, sure, but it’d be pretty surprising if “Prosser & Grankin” somehow became “Prosser & not-Grankin”. Some kind of dramatic falling-out?

I decided to take another tack. Googling had revealed that there wasn’t anything like an online database of the CfA preprint series, but the CfA library sits two floors under my office. If anyone knows how to look up items in the CfA preprint series, it’d better be them. And compared to some academic services (cough IT), I’ve virtually always had great “customer service” experiences with libraries — I imagine librarians are awfully strongly motivated (for a variety of reasons) to show how much more effective they can be than the Google search box.

Sure enough, just a day after I submitted a question in the CfA “Ask a Librarian” online form, I got a phenomenally helpful response from Maria McEachern. She had fetched the hardcopy preprint from the Harvard Depository, sent me a nice OCR’d scan of it, attempted to chase down the published form (also coming up empty), and even emailed the coauthor, Konstantin Grankin, to see if he could he could shed some light on the situation.

With a digital copy of the paper in hand, I felt confident. If the paper had been published somewhere, we’d definitely be able to track it down. If not, we had enough information to create a record in the ADS and upload the paper text, so that the text of the preprint would be easily accessible in the future — which, after all, is the whole point of providing precise references.

I had doubted that we’d ever hear anything from Grankin, but he actually replied in just a few days, filling in the last missing piece of the puzzle:

Dear Maria,

Unfortunately, Charles Franklin Prosser, the talented scientist, was lost in automobile accident in August 1998. [...] Some papers have not been published because of this tragedy. That paper about which you ask, has not been published also. [...]

Konstantin’s email included a link to Charles Prosser’s AAS obituary, written by his frequent collaborator John Stauffer. It makes for compelling reading in its way:

[...] Charles worked harder and put in longer hours than all but a few present-day professional astronomers. He could usually be found at the office seven days a week, among the first to arrive and the last to leave. Charles enjoyed harvesting astronomical data both from long observing runs on mountain tops and from rarely-read observatory publications. He was conservative in both his personal and his professional life; he very much preferred simply to state the results of his observations and to make as few extravagant interpretations of those observations as possible.

[...] Charles was survived by his parents, Charles Franklin Prosser, Sr. and Lucy Hogan Prosser, of Suwanee, Georgia, his sister Evelyn, and other relatives. His scientific papers will be offered by the family to NOAO. Charles was buried in Monterville, West Virginia on August 22, 1998. He will be remembered as a kind, dedicated, and highly moral colleague, whose greatest desire was to be allowed to work 12 hours a day in the field to which he devoted himself entirely.

Well, it’s hard to leave things half-finished after reading that. So I used the obscure ADS Abstract Submission form to submit a record and a copy of the paper text, resulting in the creation of ADS record 1997cfa..rept…..P. This achieves the most important thing to me: making it so that the preprint text is available online in a durable way. There’s also a subtle importance to the fact that the preprint is now integrated into the ADS citation system. Part of it is a certain imprimatur: this is a real publication, that you can legitimately cite. But there’s also something about how ADS bibcodes (the 1997cfa..rept.....P identifiers) are essentially our community’s vocabulary for talking about our literature. Without a bibcode and its backing record, it’s hard to use, share, discuss, or really do anything with a publication, even if you want to. (Side note: ADS is an incredibly important service in the field, and should get tons of money!)

I did some searching (mostly in Google Scholar, I must admit) and found 12 citing papers: pretty good for an unpublished preprint whose full text was virtually inaccessible until now. Those links have been added to ADS so that anyone reading one of the citing papers will easily be able to pull up the record and text of the preprint.

The ADS copy of the fulltext unfortunately loses the OCR in Maria’s PDF, making it so you can’t easily copy and paste the text and (more importantly) data tables. [Update: ADS has fixed this; I had assumed that if the OCR disappeared in the first place, their system didn’t allow for this.]  So I’ve submitted the tables to the CDS in the hopes that they’ll host a more easily-usable version of the data. CDS has a stated policy of only accepting data from papers published in refereed journals, but these circumstances are clearly exceptional; I’m waiting to hear a response about this case. If they accept the data, that should integrate the preprint into the Simbad system, further increasing its visibility.

I’m not sure if there’s a moral to this story, besides the fact that I have some OCD tendencies. Twelve citations is, frankly, not a huge number, and the very fact that there was a Mystery of Preprint 4539 signals its narrow reach: something more important would already have been integrated into ADS and the full-fledged literature, one way or another. But the research enterprise is built out of lots of tiny steps, not all coming in the form of new experimental results; I’d like to think that this is one of them.

Here’s the full text of CfA Preprint 4539, including the OCR information so that the text and tables can be easily copied. [Update: this is now fully redundant with the ADS version.] Here’s a preliminary version of the data tables that I submitted to the CDS — there are likely improvements to be made in the metadata to match the CDS formats, but the data should be fully usable and complete.

Fixing erroneous “Insufficient storage available” errors on Android

Short answer: the fix might be to delete /data/app-lib/APP-PATH/lib. You need root.

What better way to spend a sunny morning than fixing problems with my phone?

For a while, I’ve had problems on my Nexus 4 where certain apps would refuse to install, or I couldn’t update them once they were installed. The error message would be “Insufficient storage available,” but that was clearly wrong because I had plenty of storage space available and the apps were small.

Now, most technical problems are best addressed with thorough Googling, but this kind of problem is a toughie. Amateur-hour Android phone futzing is a fascinating corner of the internet in its way — grounds for a dissertation in cargo cult behavior. Between poor reading comprehension (“try moving your apps to the SD card!”), lack of actual knowledge of how the system works (“I dunno, try clearing all of your app data?”) and excessive leverage (“try upgrading your CyanogenMod hacked ROM to the latest version and running [root]CacheCleaner-v7.x”), there’s a lot of light but very little heat.

To be fair, one of the issues is that this kind of error message apparently has many root causes. My impression is that if anything goes wrong during an app install, you’ll get the “Insufficient storage available” error.

In the end, the root cause of the error seems to almost always be a leftover file that somehow interferes with the intended install/update. For instance, for many people the problem seems to be leftover .odex files in /data/app (e.g.). For me, the problem turned out to be strange dangling files in /data/app-lib. In both of the cases I had to deal with, there was a recursive symlink named /data/app-lib/APP-PATH/lib, and blowing away that file solved the problem. (Here APP-PATH is something like com.fsck.k9-1 for K9 Mail. Hypothetically.) I could imagine that in other cases you might need to blow away the whole /data/app-lib/APP-PATH directory (cf.)

The lame thing is that you need a rooted phone to do this — the relevant files are system files. If leftover app-lib files are causing your install/update problems and you don’t have a rooted phone, I think you’re just sunk. Which makes this a pretty bad bug. Maybe an OS update will prevent this from happening at some point; if not, maybe there’s a way to convince the OS to delete the offending directories on your behalf. Hopefully a fix that works on stock phones will come along, because this problem seems to bite a lot of people.

Numbered reverse-chronological CV listings with BibTeX

Prompted by a yahapj-related question from Máté Ádámkovics, I spent a little bit of time this evening figuring out how to get a “fancy” publication listing from BibTeX, where the papers are listed like:

[3] Some guy and me, 2013

[2] Me and someone else, 2010

[1] My first paper, 10,000 BC

I found a fragile way to make it work:

  • Make a copy of your preferred BibTeX style file and modify it to use the code in this StackExchange answer to reverse-sort your bibliography by year and month. (If you only know the base name of your style, you can find it with the command kpsewhich basename.bst)
  • If necessary, also modify the style file to only output \bibitem{key} text rather than \bibitem[abbrev]{key} text, since our hack can’t handle the latter.
  • Add the scary preamble from this StackExchange answer to your TeX file to reverse the bibliography list counters.
  • Put the preamble after other \usepackage commands and use as few of them as possible, since they seem to break the magic pretty easily.
  • Stick a \nocite{*} command in your TeX file somewhere. This causes BibTeX to emit a record for every item in your .bib file. (I’m presuming that you have a mypubs.bib file containing all of your publications and only your publications.)

Not exactly easy. But if there’s call for it, I’ll polish up the changes and add them to the yahapj repository with an example cv.tex for reference.

 

Bayesian Blocks Analysis in Python

I wrote a Python implementation of the Bayesian Blocks algorithm. There’s already a good implementation out in public from Jake Vanderplas, but it had some issues that I’ll briefly mention below. I’m being a bad member of the community by writing a new version instead of improving the existing one, and I feel bad about that, but I needed to get something going quickly, and I might as well put it out there.

Bayesian Blocks algorithms are nice for binning X-ray lightcurves into sections of different count rates; or, you can think of them as providing dynamic binning for histogram analysis. J. Scargle has been the main person developing the theory, and has put together a very nice detailed paper explaining the approach in 2013 ApJ 764 167. The paper’s written in a Reproducible Research style, coming with code to reproduce all the figures — which is really awesome, but something to talk about another day.

The Scargle implementation is in Matlab, with an IDL implementation mentioned but accidentally not provided, as far as I can tell. Jake Vanderplas’ code in the AstroML package seems to be the main Python implementation. It’s very nice, but has a few issues with “time-tagged” events, which is Scargle’s term for the X-ray-style case of binning up rates from individual Poisson events with precise timing. In particular, the AstroML code doesn’t let you specify the start and stop times of the observation, which can have huge semantic consequences — e.g., if you observed two events at times 1.0 and 2.0, your interpretation is very different depending on whether you observed from times 0.0 to 3.0 or times -1000 to +1000. The Vanderplas code also doesn’t implement iteration on the “p0″ parameter or bootstrap-based assessment of bin height uncertainties, as suggested by Scargle.

My implementation has these features. It also has a fix for a mistake in the Scargle paper: equation 21 has a right-hand side of “4 – [term]” when it should be “4 – log [term]“. The new module is here on GitHub, and a script for doing a blocks analysis on a Chandra events file is here. Docstrings, tests, etc, are lacking, because as mentioned before I’m a bad community member. But if you want to fool around with the code, there it is.

Typing Greek Letters Easily on Linux

I’ve already written about easily entering special characters on Linux using the “Compose Key”. The only inconvenient thing about this setup was entering Greek letters — they’re not included in the default list of compositions. I’ve learned a few of the raw Unicode values (Ctrl-Shift-u 0 3 c 3 for σ) but that’s not exactly ideal.

Disclaimer: the following really doesn’t work on Fedora 19. I’ve now set things up with a Greek keyboard option, so that hitting Super-Space once will switch me to Greek letters, and hitting it again will bring me back to normal. No more thin nonbreaking space or blackboard bold for me, though. Annoying.

You can customize the composition list to include things like Greek letters, with some limitations. Here’s the recipe for Gnome 3 on Fedora:

  • Copy the default mapping file, /usr/share/X11/locale/en_US.UTF-8/Compose, to ~/.XCompose.
  • Edit the file to include your new mappings — the format should be obvious. Here are my new entries that add Greek letters (γαγ), math blackboard bold capitals, and the thin nonbreaking space. I prefixed the Greek alphabet with “g”, so that Compose-g-b gives β.
  • You must be using the xim input method for the file to take effect. On my current Gnome/Fedora setup, I just had to run the “Input Method Selector” program and choose “Use X Compose table” from the list of options.

That’s more or less it. However, the settings don’t get applied consistently — there seems to be a conflict between the way that the Gnome Shell wants to do input and this customization. If you start a program from the terminal, custom settings seem to take effect, but if you launch it from the Shell directly, they might not. I haven’t yet found a way to get everything on the same page.

Hack: after login, run imsettings-reload ; gnome-shell --replace &  in a terminal.

Note: I initially put some example blackboard bold capitals in this post. They showed up OK in the editor, but the saved post was truncated right where I put the capitals. So there must be some bugs in WordPress’ Unicode handling.

Note 2: I initially had Compose-~-~ for a nonbreaking space, but it turns out Compose-<space>-<space> already does that.

Propagating Uncertainties: The Lazy and Absurd Way

I needed to write some code that does calculations and propagates uncertainties under a fairly generic set of conditions. A well-explored problem, surely? And indeed, in Python there’s the uncertainties package which is quite sophisticated and seems to be the gold standard for this kind of thing.

Being eminently reasonable, uncertainties represents uncertain variables with a mean and standard deviation, propagating errors analytically. It does this quite robustly, computing the needed derivatives magically, but analytic propagation still fundamentally operates by ignoring nonlinear terms, which means, in the words of the uncertainties documentation, that “it is therefore important that uncertainties be small.” As far as I can tell, uncertainties does analytic propagation as well as anything out there, but honestly, if your method can’t handle large uncertainties, it’s pretty useless for astronomy.

Well, if analytic error propagation doesn’t work, I guess we have to do it empirically.

So I wrote a little Python module. To represent 5±3 I don’t create a variable that stores mean=5 and stddev=3 — I create an array that stores 1024 samples drawn from a normal distribution. Yep. To do math on it, I just use numpy‘s vectorized operations. When I report a result, I look at the 16th, 50th, and 84th percentile points of the resulting distribution.

Ridiculous? Yes. Inefficient? Oh yes. Effective? Also yes, in many cases.

For instance: the uncertainties package doesn’t support asymmetric error bars or upper limits. My understanding is that these could be implemented, but they badly break the assumptions of analytic error propagation — an asymmetric error bar by definition cannot be represented by a simple mean and standard deviation, and an upper limit measurement by definition has a large uncertainty compared to its best value. But I can do math with these values simply by drawing my 1024 samples from the right distribution —skew normal or uniform between zero and the limit. I can mix perfectly-known values, “standard” (i.e. normally-distributed) uncertain values, upper limits, and anything else, and everything Just Works. (It might be hard to define the “uncertainty” on a complex function of a mixture of all of these, but that’s because it’s genuinely poorly-defined — analytic propagation is just misleading you!)

Another example: uncertainties spends a lot of effort tracking correlations, so that if x = 5 ± 3, then x - x = 0 precisely, not 0±4.2. My approach gets this for free.

I’ve found that approaching uncertainties this way helps clarify your thinking too. You worry: is 1024 samples big enough? Well, did you actually measure 5±3 by taking 1024 samples? Probably not. As Boss Hogg points out, the uncertainties on your uncertainties are large. I’m pretty sure that only in extreme circumstances would the number of samples actually limit your ability to understand your uncertainties.

Likewise: what if you’re trying to compute log(x) for x = 9 ± 3? With 1024 samples, you’ll quite likely end up trying to take the logarithm of a negative number. Well, that’s telling you something. In many such cases, x is something like a luminosity, and while you might not be confident that it’s much larger than zero, I can guarantee you it’s not actually less than zero. The assumption that x is drawn from a normal distribution is failing. Now, living in the real world, you want to try to handle these corner cases, but if they happen persistently, you’re being told that the breakdown of the assumption is a significant enough effect that you need to figure out what to do about it.

Now obviously this approach has some severe drawbacks. But it was super easy to implement and has Just Worked remarkably well. Those are big deals.

My New Snoozing Technique is Unstoppable

One of the great things about an academic job is flexibility in the hours. But this blessing can be a bit of a curse: it can be hard to get the day started when you don’t have to show up at work at any particular time. (Cue tiny violins, #richpeopleproblems, etc.) Like many people in this position, I’ve had a longstanding love/hate relationship with the snooze button on my alarm clock. I’d usually hit snooze four or five times and spend about 45 minutes half-awake feeling guilty for not really wanting to get up and at it.

Around a month ago I decided to try and get out of the rut. Unsurprisingly, the internet is full of advice on how to get out of bed in the morning, and equally unsurprisingly, most of it is dubious at best. Even less surprisingly, a lot of the advice is from self-appointed self-improvement gurus (whom I often find to be kind of bizarrely fascinating — they’re so weird!). One of these people is named Steve Pavlina, and I have to admit that I gleaned something from a blog post of his on the the topic. I didn’t take his advice (simulate waking up quickly during the day), but I did like his perspective: relying on sheer willpower is not the answer. Maybe it works for some people, but not for me, and it’s really not fun to start every day with a demoralizing lost battle against inertia.

I decided that I needed to find something fun and easy to do first thing every morning. Not necessarily something that’ll get me up and out of bed instantly — but something that I’ll look forward to doing and that’ll keep me awake as my brain warms up. I’m a bit of a nerd, so what fits the bill perfectly for me? Catching up on my favorite blogs.

Specifically, instead of checking them compulsively all day, I now read them all in one big burst in bed using my phone and an app called NewsRob. My alarm goes off, I hit snooze once, and the next time it rings I sit up and grab my phone. Catching up on everything takes about the same amount of time as my previous snoozing, but it’s actually fun and doesn’t make me feel guilty. As a side benefit, I’m no longer compulsively checking blogs all day!

People who aren’t me will presumably have different priorities. A friend of mine watches the Daily Show instead. (One advantage of that over the blogs is that it’s not always easy to read a lot of dense text while your head’s still fuzzy.) I’m sure you could come up with other options.

So far, the new system’s been working great. My mornings feel way better and what I’m doing is so easy that I don’t think backsliding is going to be a problem. Baby steps to a better life!

(apologies to David Rees for the title [←bad language])

Fun with SSH configuration: connection sharing

Because nothing says “fun” like SSH tricks!

More and more organizations that use SSH are restricting it through “login” (aka “bastion”) hosts: locked-down machines that are the only bridge between the wilds of the Internet and internal networks. These often ban SSH public-key authentication, making you type in your password for every login. This can get to be a hassle if you’re frequently logging into your internal network. (As a side note, the link above is to the best explanation I could find, but I imagine it’d still be confusing for a newbie. Someone needs to write clear pedagogical material on SSH with public keys!)

There’s a somewhat-new feature in OpenSSH that can make life easier, though. You can now configure it to automatically “share” preexisting connections: if you’ve got one connection going and try to start another one, it’ll reuse the authentication and set up the connection without needing a password. So if you keep your first connection going, you can open more of them for free. Clearly this won’t always be useful, but for me, it often is.

To set this up, put something like the following lines in ~/.ssh/config:

ControlMaster = auto
ControlPath = ~/.ssh/connshare.%h_%p_%r.sock
ControlPersist = no

Read the ssh_config manpage to learn about the meaning of these settings and what some possible alternatives are. Some versions of SSH don’t support the ControlPersist option, in which case you can just leave it out.

With this setup, you can log in once at the beginning of the day and not worry about typing your password until quittin’ time, if you don’t close the original session. Or, if you have a program that needs to operate over SSH but for some reason can’t deal with the password prompts, you can pre-authenticate the connection and skip them.

While I’m at it, a few other SSH tips:

  • If you log in to remote computers a lot, the single biggest favor you can do yourself is to learn how to use screen or tmux.
  • Despite the lack of good introductory materials, it’s also really worthwhile to learn how public-key authentication works, and to use SSH keys when appropriate.
  • … and a super useful thing about SSH keys is the ssh-agent program, which remembers your decrypted keys. This lets you skip passphrase entry without compromising security (at least by real-world standards).
  • You may know that ssh user@host command will run command on your destination. To chain SSH invocations, use ssh -t user@outerhost ssh -t innerhost. The -t option is needed for unimportant reasons related to password entry.
  • Finally, you can also use your ~/.ssh/config file to preset usernames, ports, X11 forwarding, etc. for specific hosts. See the manual page.

Reference: Running Chandra cscview on Fedora Linux

You’re supposed to access the Chandra Source Catalog with a Java applet, but it appears there’s no version of Java compatible with both the applet and Firefox. (In the sense that many Java versions have big security bugs and Firefox blocks them.) Thus although there are recommendations for using the applet on Fedora Linux I believe they are now inoperative. Here’s a solution.

  1. Download and install the latest Java 1.6 JDK. The JRE won’t cut it since we need the appletviewer program. The instructions above call for 32-bit, but I think that’s only for Firefox compat; 64-bit should be OK.
  2. Download a local copy of the HTML for the viewer applet. The official version specifies width="100%" which appletviewer can’t handle; I can’t think of a clever way to get around this.
  3. Edit the local copy to say something like width="640" height="480".
  4. But now we’ve lost the base URL, so create a subdirectory called client and download the jar files mentioned in the HTML, e.g., jsamp 1.0.0.0.
  5. Edit jdk*/jre/lib/security/java.policy and copy the java.security.AllPermission to apply to all applets. Yeah, we’re classy.
  6. Finally, run jdk*/bin/appletviewer file:///path/to/your/cscview.html .
  7. (Optional) Discover the cscview doesn’t do what you want. ☹

How to Increase App Space with a Partitioned SD Card on an HTC Nexus One Running Android 2.3.6 on Linux

This recipe written December 2012. Keep in mind that techniques vary a lot over time. And, yeah, sorry for the SEOriffic title.

I like a lot about the HTC Nexus One phone, but one of its annoying problems is that it has a tiny amount of storage space for apps. Maps, Twitter, Firefox, Go SMS Pro, and K-9 Mail are pretty much must-haves for me, and they alone almost fill up the phone’s “internal storage” for apps. It’s basically one-in-one-out for installing apps, and the constant “space is running low” messages are not only annoying but also indicate that the phone won’t perform certain important functions.

If you Google around, you’ll see various “app2SD” or “apps2SD” apps or scripts that promise to let you move apps to the SD card, which isn’t huge (4 GB for me) but is bigger than the 180 MB of the internal storage. There’s built-in support for this with some apps on Android 2.3.6, and some of the app2sd tools seem to be remnants from an earlier time. (Others are probably just trying to cash in on ignorant people.) You also see people talking about reformatting your SD card to add an ext-format partition to allow app storage there, which makes sense; the VFAT filesystem used by the SD cards by default doesn’t support some of the features needed for system files.

It took me a surprisingly long amount of time to find what seems to be the current best system: Super APP2SD by XDA Developers user TweakerL with assistance from several others. The basic approach is straightforward: in the early stages of system startup, a partition on the SD card is bind-mounted to a few magic app directories, redirecting their storage from the internal NAND memory to the SD. And that’s all you need to do!

As usual, the code written by the XDA people is a little dubious and scary, and the instructions are not great and also targeted at Windows users. So I’ve adapted their technique and documented what I did here. That being said, big thanks to TweakerL and friends for figuring this out and publishing the technique. The caveats:

  • You need a rooted phone.
  • Of course you run the risk of bricking your phone, killing all your apps, etc.
  • Apps become substantially slower to load and less responsive. The difference is noticeable and annoying. I still prefer having the freedom to actually install new apps, which I basically didn’t before, but this downside is significant.
  • The transfer is all-or-nothing, so you can’t keep a few apps on internal storage for faster loading.
  • The storage space reports on the phone become inaccurate.
  • None of your apps are available if your SD card is unavailable.
  • I’ve only been using this system for a little while, so there may be other issues that I haven’t yet appreciated.

Here’s how it’s done:

  1. Do some preparation.
    1. Root your phone, perhaps using my instructions. This has its own large set of caveats and hassles. For me, at least, the lack of app storage is an annoying enough issue that I’m happy that I went to the trouble of the rooting.
    2. You’ll need the Android SDK and the adb program. You just need the SDK, not the ADT bundle. My post on rooting has a bit more info on the installation.
    3. You also need to install Busybox on the phone to get a more powerful mount program for later. You install the Busybox app, then run it to actually install the needed files. It wanted to place all sorts of crazy programs (httpd??) that I disabled, but they’re probably fine. I think the only truly necessary tools may be mount and mknod.
  2. Back up everything, especially your apps (duh), their data, and your SD card.
  3. Add an ext3 partition to your phone’s SD card.
    1. You backed up your SD card, right? Because you’re about to wipe it.
    2. Somewhat surprisingly, you can do this while your phone is running. Connect your phone to your computer via USB and turn on USB storage. Don’t mount the VFAT partition.
    3. I added partitions using the very nice GNOME Disks program. I essentially shrank the first partition (by deleting and recreating it), changing it from 4 to 3 GB, and then I created a new ext3 partition of 1 GB. The volume labels aren’t important. The partitions should be primary/bootable. Android 2.3.6 can’t handle ext4, so use ext3. Make sure that the type of the VFAT partition is W95 FAT32; I had to change this manually after making the partition. You should be able to mount the VFAT and ext3 partitions on your computer after setting them up and verify their setup.
    4. Restore your backup of the VFAT files.
    5. If you disconnect the USB, your phone should reload the SD card and not notice any problems. If it does, reformat the card (which the phone will offer to do), then reconnect to your computer and figure out what you did wrong. For me, it was setting the partition type.
  4. Get the ext3 partition mounted on your phone.
    1. This part of the process involved some trial-and-error so I’m not sure what’s strictly necessary. I’m pretty sure that you need to reboot the phone to get the OS to notice the new SD partition.
    2. Turn on USB Debugging on the phone. Connect the USB and open up a shell using the Android SDK’s platform-tools/adb shell command. Use su to become root.
    3. Many of the built-in shell utilities are surprisingly braindead, even given that we’re running Unix on a tiny piece of plastic and metal. When in doubt, prefix commands with busybox to use the Busybox versions, which are generally smarter.
    4. Run busybox mount -o rw,remount / and busybox mount -o rw,remount /system to be able to monkey with the filesystem.
    5. You may also need to manually create device nodes to be able to mount the ext3 partition. The VFAT partition of the SD card is mounted from something like /dev/block/vold/179:1, but the Busybox mount program seems to cut off the filename at the colon. There seems to be an equivalent device node called /dev/block/mmcblk0p1, so I created /dev/block/mmcblk0p2 using busybox mknod and emulating the parameters of the VFAT partition.
    6. After the right magic has happened, I created a mount point at /mnt/sd-ext and was able to mount the partition with busybox mount -o nosuid,nodev,noatime /dev/block/mmcblk0p2 /mnt/sd-ext .
    7. Also create a mount point at /mnt/nand-data for later.
  5. Mirror the apps and data from the internal storage to the SD.
    1. The relevant internal directories are /data/app and /data/data. The XDA Developers approach also mirrors /data/media/Android, but I can’t find anything resembling this directory on my phone, so I suspect that it’s only present in newer Android versions.
    2. Create /mnt/sd-ext/data/app and /mnt/sd-ext/data/data and duplicate the contents of /data/app and /data/data using your favorite technique. You may need to use busybox cp and not just cp to get more useful options. Make sure to preserve permissions, ownership, etc.
    3. I touched /data/data/this_is_old and /mnt/sd-ext/data/data/this_is_new as sanity checks for later.
  6. Install the code for futzing with the mounts at boot time.
    1. The trick here is to intercept an invocation of /system/bin/debuggerd, which is apparently called early in the boot process.
    2. Move /system/bin/debuggerd to /system/bin/debuggerd.bin.
    3. Create /system/bin/debuggerd.shim with the following contents:
      #!/system/bin/sh
      /system/xbin/pkgw.earlymounts
      exec /system/bin/debuggerd.bin

      I did this using adb push onto the VFAT SD card then moving the files. You can’t push directly into /system/bin because of permissions. Give the file permissions and ownership mirroring debuggerd.bin.

    4. cd /system/bin ; ln -s debuggerd.shim debuggerd
    5. Create /system/xbin/pkgw.earlymounts with the following contents:
      #!/system/bin/sh
      
      export PATH=/sbin:/system/sbin:/system/xbin:/system/bin
      
      nanddata_dev=/dev/block/mtdblock5
      nanddata_mount=/mnt/nand-data
      sdext_dev=/dev/block/mmcblk0p2
      sdext_mount=/mnt/sd-ext
      
      busybox mount -o remount,rw /
      busybox mkdir -p $sdext_mount $nanddata_mount
      busybox mount $nanddata_dev $nanddata_mount
      busybox mount -o nosuid,nodev,noatime $sdext_dev $sdext_mount
      if test -f $sdext_mount/data/remount_is_safe ; then
        busybox mount -o bind $sdext_mount/data/data /data/data
        busybox mount -o bind $sdext_mount/data/app /data/app
      fi
      busybox mount -o remount,ro /

      It should be pretty apparent how it all works. It should also be apparent that the device paths and mount paths should be checked against the output of mount to be sure that everything is pointing in the right place. Set ownership and permissions on pkgw.earlymounts as with debuggerd.shim. (The mkdir -p above might not be needed? I don’t know what persists across boots. I also discovered that the Android shell doesn’t have [ as an alias for test by default.)

  7. Reboot the phone to verify that the shim works. If you log in with adb shell, the /data directory should be unchanged, but the /mnt/sd-ext and /mnt/nand-data should be mounted appropriately right from bootup.
  8. Activate the bind mounts.
    1. Enable them by touching /mnt/sd-ext/data/remount_is_safe.
    2. Reboot the phone.
    3. Along with the new mounts in /mnt, you should see that /data/data and /data/app are mounted from /dev/block/mmcblk0p2 now. Their contents should be identical and your phone should work. But /data/data should contain a file called this_is_new, not this_is_old. Your apps are now all living on your SD card!
    4. Check out how much storage your phone thinks it has free in the internal app storage space. Try installing a new, large app. The amount of free space should remain unchanged. (If nothing else has happened. Firefox’s hungry cache ate up some of my space and made me think something had gone wrong.)
  9. Clean up.
    1. Delete files in /mnt/nand-data/data and /mnt/nand-data/app to free up space on the internal storage. Make sure you’re working in the right directory! Try deleting something noncritical, then confirming that it still runs on the phone. Reboot again if that’ll make you feel better. Don’t delete critical-sounding apps and data in case your SD card goes missing. After doing this, my internal storage has 74 MB free, which should be enough for all sorts of temporary stuff.
    2. In the app manager, move apps back “from” the SD “to” internal storage — even though internal storage is now also on the SD. We’re just relocating things from the VFAT partition to the ext3 partition. This just makes it so you can still use those apps when you’ve got the VFAT mounted over USB.
    3. Go to the app store, go to My Apps, slide over to the All listing, and install all those old apps that you removed because of space restrictions.
    4. As a side benefit, you can now mostly back up your apps and their data directly over USB, rather than by having to use a dedicated backup program.

Big thanks to TweakerL and the XDA Developers folks for blazing the trail!

Update: Having been using my phone this way for the past few days, I’m still happy I did this, but I might consider only moving the app data, and not the apps themselves, to the SD card. There would still be a lot of space left in the system storage for a reasonable number of apps, and there’s a possibility it would help with the responsiveness issues I’m seeing.

Silence is Golden

Scientists love to write programs that are too chatty.

Chatty programs are eager to tell you what they’re doing and how they’re doing it. On the command line, this means you get a lot of output:

$ ./run-simulation.py input.dat output.dat
Science simulation by Peter Williams
Opening the input "input.dat" ...
   ... 300 items
Running simulation!
Step 1
DEBUG: n=12
Step 2
Step 3
Opening "output.dat" for writing ...
All done!
$

Contrast this with classic Unix tools, where silence implies success:

$ cp source.txt destination.txt
$

There are a lot of reasons why people write chatty programs: some good, some at least understandable. But I want to suggest that chattiness is something to be actively fought.

One concrete issue is that it’s hard to notice when chatty programs have problems. Showstopping errors are generally easy to pick out because they’ve, well, stopped the show, but I can’t tell you how many times I’ve missed some important warning message because it got scrolled off the screen by a bunch of inane diagnostics. And beginning programmers — which is what most scientists are — generally have trouble writing error-handling code and often write programs that charge ahead past conditions that should be showstoppers.

Less concretely but just as importantly, chatter is distracting. When I’m using a computer, I’m trying to get stuff done, and I get more done when I’m not wasting time thinking about low-level details: I’m in a line of work where my attention is one of my most valuable resources. A chatty program is like a needy employee who can’t make progress without coming back to you over and over for handholding and reassurance. Just do what I asked and stop bothering me unless you run into a problem you can’t solve. When I need to move files around in the terminal, I can bang out commands and I just know that everything is OK because I’m not seeing any messages. I have to stop and read the output of my simulation to see if it went wrong.

My advice is look at every line your program prints and ask yourself, do I really care? Vital outputs should be respected by not being surrounded by chatter, or by being recorded to disk instead of left to scroll off the terminal screen. Diagnostics should be separate: either optional and off by default, or output in some secondary way so that the “main” output is clean. Always keep in mind that every unhelpful message is drawing your precious attention away from the important ones.

For the beginners out there, I’d argue that reducing chatter also helps build confidence. One tends to have this irrational fear that the program usually works but maybe this time it’s derailed in some crazy way and is producing garbage. Seeing the same output over and over again is reassuring. But take those training wheels off! If you’re seeing the same output over and over, by definition you’re not learning anything new. To be confident in quiet programs, you do have to evaluate which parts of your program are the most likely to go wrong, and you do need to get good at handling error conditions. These are vital skills for competent programming, and the faster you develop them, the better.