Files as modules

· Allanderek's blog

#compilation #language-design

Elm has followed many other languages in equating a file to some form of encapsulation unit, in the case of Elm this is a module, much like it is in Haskell, Java used a class. In Java it is possible to define a nested class, and in most forms of the ML family of languages (SML, O'caml) you can define nested modules. In particular there is no reason why a single file need define one and only one module.

One question is, why would we equate a file to a form of encapsulation? Why, would we wish to divide any application into a bunch of files in the first place. If you ask this question, you get a bunch of answers, but they mostly fall into two categories: 1. some vague notion about organisation and modularising your code and 2. it is easier to navigate whilst coding. I would argue that the first is nonsense, since no one is arguing against modularising code, only that the file system does not need to come into it. The second is potentially true, but rather speaks to a deficiency in your coding environment (text editor/IDE) rather than to the efficacy of using files to modularise a program's source code.

It is worth thinking about another reason to split code into multiple files. Files represent convenient compilation units. It is important to avoid long compilation times for various reasons, and one way to reduce this is to avoid re-compiling code that has not changed. This is basic memoisation. Files represent a convenient way to do this since you can use the modification time on the file as a very fast and decently accurate proxy for whether the file's contents have changed. Now, when you determine that a file's contents have changed and must be recompiled, how do you determine whether other files need to be recompiled? In some way you need a signature of the file's contents, if the signature changes then any file that depends upon it should be recompiled, but if the signature doesn't change then you shouldn't need to recompile any dependent files. Because you need a signature for a file it makes sense to equate a file to an encapsulation unit, such as a module, since then you do not need to maintain two separate forms encapsulation semantics, one for program modularity and one for program compilation.

Historically compilation times were such that you wanted a file to be no longer than around 2000 lines of code. Any longer and you were in danger of increasing your compilation times needlessly. Remember the file is just an approximation to what needs to be re-compiled. In theory we could check every top level definition and only re-compile those that have changed, but then the check itself begins to take up some time. However, modern computers are capable of pretty fast compilation times. One could argue that you could easily get away with just always recompiling the entire project for most projects. Projects that are large enough such that compiling the whole project takes some time, could arguably be said to be too large and could benefit from breaking up into separate projects (Note for example you could have a mono-repo consisting of more 'projects', what I mean by a project is something that must be compiled together). Even if you do not take this view, it's clear that modern computers are capable of compiling larger projects much faster, such that files in modern projects could easily be a hundreds-of-thousands of lines long without being a major problem for compilation times.

My point here, is that recompilation should not be a major factor in determining the size of the modules within your project. Because of this we can have several encapsulation units within a single compilation unit, which translated means, we can have multiple modules within a single file. In Elm we already have nested modules, where parent modules correspond to directories. We just have no syntax to nest a module within a module which itself corresponds to a file.

Okay but is there anything wrong with forcing programmers to split up their modules into separate files? I believe that forcing this makes a particular modularisation of a program more concrete, and therefore less likely to be refactored into a more appropriate one, if the initial choice is deemed to be unsatisfactory.