Crying out loud for better research artifacts

In my first semester during my Computer Science Master’s program at TU Dortmund University1, I participated in a seminar titled “Reproducibility of
Research Artifacts”
. The gist of it: Our professor studies the use of research artifacts in the Computer Science community. To gather data on the usefulness of research artifacts, he devised this seminar, where a bunch of undergraduates each read through six old conference papers and try to use the accompanying research artifacts to (among other checkboxes) roughly verify the paper’s claim. This seminar is offered multiple times by multiple professors around the world, slowly building up a pile of new metadata to be analyzed for some new paper about reproducibility.

I am now done with reading the introductory papers and checking my six research artifacts, leaving me only with the dreaded task of contributing a few pages of formal research talk. Paragraphs be graded, stapled to the other participant’s texts and archived somewhere, ever only read by two persons. Rephrasing what all the other students contributed already, be it this year or the years before and after. The perfect use case for “Hey ChatGPT, please rephrase this text from before but replace their six papers with mine”.

At times, my brain sucks. Especially when it is time to write in academic lingo. This post is my attempt to extract the gist of my »findings« from its mental claws.

Badges of honor

In the old times of 2010, the term »replication crisis« popped up in academia. Psychology people stumbled upon the fact, that they could not reliably reproduce original findings from older papers, effectively downgrading them from scientific research to hot garbage. Innocent until proven guilty and scientifically worthless until correctly reproducible by a later study.

Watching from the side line of the spectacle, the computer science community took notes and decided “Yeah, maybe we should include the source code of our experiments with our papers instead of simply claiming things”. And because nerds do love to make up new standards, the ACM badges were born. I present to you the pinnacle of serious science:

Hand in your paper to a conference, ask for your artifacts to be reviewed as well and maybe be allowed to stick a few of these precious PNGs to your first page: First page of paper titled “Horn-ICE learning for synthesizing invariants and contracts”. In the top right corner, two circular badges. A green one: “artifacts available” and a red one “artifacts evaluated functional”

Yeah, go bold colors!

The first badge is more or less free: ZIP your source code and let the CERN store it forever for at least 10 years. Great, now your artifact is »available«.

Next up: »Functional«: Works-On-My-Machine is no longer enough, now other people need to be able to run your hacky data processing pipeline. Ugh. But okay, you can create a virtual machine image with all your code and dependencies, create a bash script and write a short README.pdf describing how to log in, open a terminal and execute said bash script. Hey CERN, can you please also store this 5 GB Ubuntu 8 VM image for us? Now some poor soul in rural Germany can download that image for a few days to see that your bash script executes without errors, printing out some ASCII symbols along the way, hopefully none of them in red. Hey, it works.

But functional is easy, how about »Reusable«. No problem at all: Just make sure that every possible developer on the planet can use your research-quality code for their next project without friction. Take a deep breath and spend a night or two documenting your code, removing unused functions and assumptions about your specific VM and hope for the best. Depending on the mood of the reviewer, your documentation is enough. Gladly they also use the MacOS version you used, so they can tweak a few lines and the graphs are now pink instead of blue. Take that third, red badge and be proud of yourself. Good boy!

Now there are also badges relevant to the replication crisis: »Results Reproduced« and »Results Replicated«. Reproducing and replicating can mean either “other people re-did your research with your tools” or “other people re-did your research on their own, without your tools”. To avoid any clarity, the two terms can mean either one of these practices, and it depends on the current conference committee to decide how they want to define that. At least the ACM badges now settled on reproducing = same result, original tools and replicating = same result, different tools in v1.1 of their badges.

These two blue badges are rarely found on papers. Who waits for an entire replication study, before first publishing a paper?

Anyways, here is my professor explaining research about artifact evaluation and the expectations one might have for it in video form:

Thumbnail, beige title slide: Community Expectations for Research Artifacts and Evaluation Processes

Investigation procedure

To achieve some degree of comparability, students investigating the reproducibility of research artifacts were given a lab report template, written in Markdown to log describe efforts and note observations. It includes metadata (DOI, time it took to check everything, badges the paper previously got) and a section each for »Relatedness«, »Functionality« and »Reusability«. Note the absence of a »Reproducibility« section. For those three properties, we were asked to make a boolean decision: Does the paper’s artifact earn the right to hold that title? How was that checked? Which interesting findings were observed? The whole template is rather free-form, leaving it as an exercise to our professor to normalize all lab reports at a later stage.

Click to view the full labnotes template
# Lab Notes on Research Artifact

* Date of examination:
* Start time:
* End time:

## Metadata

* Paper DOI (necessary):
* Artifact DOI (optional):
* Badges achieved in regular artifact evaluation:
* Time (in minutes) needed to read the paper:

## Availability

* [ ] Artifact is archived in a public archive with a long-term retention policy
* [ ] Artifact is available on a different website
* Time (in minutes) needed for the check:

### Checks performed

<!-- Note all checks you performed here to answer the above questions -->
1. Artifact is in a well-known archive (i.e. FigShare, Zenodo, or
2. ...

### Observations

<!-- Note all observations (negative, neutral, or positive) made -->

## Relatedness

* [ ] Artifact is related to the paper
* [ ] Artifact is NOT related to the paper
* Time (in minutes) needed for the check:

### Checks performed

<!-- Note all checks you performed here to answer the above questions -->

### Observations

<!-- Note all observations (negative, neutral, or positive) made -->

## Functional

* [ ] Artifact is considered functional
* [ ] Artifact is NOT considered functional
* Time (in minutes) needed for the check:

### Checks performed

<!-- Note all checks you performed here to answer the above questions -->

### Observations

<!-- Note all observations (negative, neutral, or positive) made -->

## Reusable

* [ ] Artifact is considered reusable
* [ ] Artifact is NOT considered reusable
* Time (in minutes) needed for the check:

### Checks performed

<!-- Note all checks you performed here to answer the above questions -->

### Observations

<!-- Note all observations (negative, neutral, or positive) made -->

Each artifact investigation was supposed to last six hours at max. If we could not get the artifact to function in that time, we could label it as “not functional”. Inaccessible artifacts (either due to unavailability or due to impossible hardware requirements) were swapped for new artifacts.

The suspects

As I’ve hinted on earlier, I had to review six papers and their accompanying artifacts. So let me quickly present the suspects, assigned to me at random:

A) Horn-ICE learning for synthesizing invariants and contracts

The authors built a new verifier for programs, to crush other state-of-the-art verifiers at some program verifying competition for logic nerds. Its main contributions are a dual-role system: A learner component proposes hypotheses, for which the teacher component then tries to generate negative inputs in the form of a Horn clause. Apparently it is fast and they are righteously dunking on other researchers’ verifiers:

Excerpt from Horn-ICE paper: comparison bar chart where Horn has bigger numbers than another approach

B) Codebase-adaptive detection of security-relevant methods or »SWAN«

You have heard about taint analysis, haven’t you? No? Okay: We look at paths in code (in this paper’s case in Java for Android apps) which take untrusted input and color (»taint«) the incoming data. The data keeps its taint while traveling through the code base, only to be removed when passing through sanitizer functions. If tainted data reaches an output function (where malicious inputs could actually attack the system or user), we found a potential security vulnerability.

So these guys built SWAN and SWAN_Assist, an accompanying InteliJ plugin. It does some machine learning magic on Android Java code to detect and annotate input, sanitizer and output functions. Maybe it can help Android folks write better stuff. I have some serious respect for actually building something usable (an InteliJ plugin) from their research, instead of just a bare-bones proof of concept:

InteliJ IDE: On the right is a sidebar with a list of function names, each with a little colored dot to the left, indicating the classes source, sanitizer or sink

C) Prediction of atomic web services reliability based on k-means clustering

This one does not count, as the artifact could not be retrieved. It was replaced with a different paper

D) Verifying concurrent search structure templates

Computer Scientists like to make a great deal out of formally verifying that their algorithms and data structures really do what they are supposed to do. While this is a great way to scare first-semester students, doing so for threaded, concurrent data structures makes a great way to keep PhD students busy. To my demise, these authors from New York used the proof assistant coq, which I don’t really understand. But: They claim to have managed to construct verified implementations of B-trees, hash tables, and linked lists, with which you could built safer-than-average file systems and databases. Great!

E) Fuzzi: a three-level logic for differential privacy

A catchy name, ain’t it? The authors chip away at »differential privacy«. DP says: We don’t want your personal data, we want statistics over all our customers. So let’s add some noise (speak: random changes) to your data before sending it to our analytics server. They applied that to machine learning algorithms and used that as an excuse to write Fuzzi, a checker to calculate sensitivity bounds of the privacy attributes for the generated models. Somehow this is better than previous methods and simpler.

F) Developer Reading Behavior While Summarizing Java Methods: Size and Context Matters

Hey ethics board, can we strap some poor CS undergrads to an iris-scanner for science, please?

The most straight-forward paper title: A study in which 18 participants of different skill levels were asked to summarize Java methods. An eye-tracker took note at which classes of code elements they looked at the most. They come to the conclusion that (contrary to an older study) developers tend to focus on the control flow terms in the body and less on function signatures. The bigger the code, the more they jump back and forth. Maybe these findings are useful for something practical down the line.

G) Lazy Product Discovery in Huge Configuration Spaces

The authors of this particular paper took on the challenge of speeding up Gentoo package installations. More scientifically, they used it as a case study for product configuration. While installing a set of packages, the package manager (in Gentoo’s case emerge) must determine a valid set of dependencies to actually install on the system. As always, this problem can be reduced to SAT, and the insanely huge number of possible combinations (the huge configuration space) makes any NP implementation quickly blow up in time. So this paper contributes a lazy-loading method which only looks at smaller subsets of configurations at a time to bring this into the realm of the possible. Maybe unintuitive, the emerge package manager is way faster than this method, but it does cheat a bit and can’t find a solution (= No installation possible) for some configurations, for which this paper’s method was able to find solutions:

The problems

Now that we have shown, that we have actually read the paper at some time, let’s dunk on them. Have they done their homework and crafted useful, eye-pleasing software artifacts? As always: it depends.

These scandalous conclusions will shock you!

Other miscellaneous findings:


After a few more rounds of this seminar, all the lab reports will be analyzed. Each artifact is supposed to be reviewed independently by at least two students of different universities. Some form of new paper will be written about those results. At least one conference will be attended. Science done. Move on. Cite other papers. Be cited by friends and colleagues. Use your prowess to beg research associations for external funding.

As for us students: We will be cited somewhere in the fine print of some paper. Leaving a small, citable footprint for the 3% who are interested in an academic career.

  1. Useless Trivia: The very-official title of “TU Dortmund University” is an abbreviation for “Technical University Dortmund University”. ↩︎