Mining Debian Maintainer Scripts

Speakers: Ralf Treinen Nicolas Jeannerod

Track: Packaging, policy, and Debian infrastructure

Type: Long talk (45 minutes)


Room: Yushan (玉山) Live Stream

Time: Jul 31 (Tue), 10:00

Duration: 0:45

Debian sid contains more than 31.000 maintainer scripts, the vast majority of which are written in the POSIX shell language. We have written, in the context of the CoLiS project, the tool shstats which allows us to do a statistical analysis of a large corpus of shell scripts. The tool operates on concrete syntax trees of shell scripts constructed by the morbig shell parser. The morbig parser has already been presented at a devroom at FOSDEM 2018, and at the minidebconf 2018 in Hamburg. The shstats tool comes with a number of analysis modules, and it is easy to extend it by adding new modules.

In this talk we will present both the design of the analyzer tool and how it can be extended by new analysis modules, and some of the results we have obtained so far on the corpus of sid maintainer scripts. Among these are answers to questions like:

  • are recursive functions used in maintainer scripts?
  • how many variable assignments are in reality definition of constants?
  • how many shell scripts don’t do any variable assignments, besides definitions of constants?
  • how often are shell constructs like while, for, if, etc. used?
  • which UNIX commands are most commonly used in the corpus, and with which options?
  • are there any syntax errors in the arguments of complex commands like test?