I'm starting with GHC 4.08.2. #+BEGIN_SRC sh wget http://downloads.haskell.org/~ghc/4.08.2/ghc-4.08.2-src.tar.bz2 tar xf ghc-4.08.2-src.tar.bz2 #+END_SRC The compiler's entry point is =ghc-4.08.2/ghc/compiler/main/Main.lhs=, which is a Literate Haskell source file. Hugs can process Literate Haskell files, so I don’t need to build the =unlit= tool first. =Main.lhs= requires pre-processing with CPPHS. Hugs comes with an implementation of CPPHS. Although Hugs supports calling out to a pre-processor as it reads source files, for better performance it is advisable to pre-process all source files ahead of time. The first problem is that GHC includes a parser grammar source file =ghc/compiler/parser/Parser.y= that needs to be processed with Happy, the Haskell parser generator. Happy is, of course, written in Haskell. But wait! Hugs needs to parse Haskell source code as well, so certainly it should have a parser module, right? Indeed! #+BEGIN_SRC sh cp /gnu/store/dv277478vmyxklisjgvzgbwj562ky9zs-hugs-Sep2006/lib/hugs/packages/haskell-src/Language/Haskell/Parser.hs ghc/compiler/parser/ #+END_SRC And now we can try to run GHC on top of Hugs: #+BEGIN_SRC sh runhugs -F"/gnu/store/kw69dvs14ddbb2d7s8b2v20hh54v29nz-cpphs-on-hugs-1.13.3/bin/cpphs -I$PWD/ghc/compiler" -P"{Hugs}/packages/*:$PWD/ghc/compiler/*:$PWD/ghc/compiler/utils/*" ghc/compiler/main/Main.lhs #+END_SRC Of course, this fails. GHC contains modules that recursively depend on one another. + merged =ghc/compiler/hsSyn/Hs{Expr,Binds,Matches}.lhs= to =ghc/compiler/hsSyn/HsExprBindsMatches.lhs=, then redefined =HsExpr.hs=, =HsBinds.hs=, and =HsMatches.hs= in terms of that bigger module, re-exporting all definitions. + replaced =panic#= with =panic'= and =pprPanic#= with =pprPanic'=, because Hugs doesn’t like the unboxing character. This might lead to problems later on. Try again: #+BEGIN_SRC sh runhugs +98 -F"/gnu/store/kw69dvs14ddbb2d7s8b2v20hh54v29nz-cpphs-on-hugs-1.13.3/bin/cpphs -I$PWD/ghc/compiler" -P"{Hugs}/packages/*:$PWD/ghc/compiler/*:$PWD/ghc/compiler/utils/*:$PWD/hslibs/lang/*:$PWD/hslibs/*" ghc/compiler/main/Main.lhs #+END_SRC Now I get this error: #+BEGIN_EXAMPLE runhugs: Error occurred ERROR "ghc/compiler/main/CmdLineOpts.lhs" - Can't find imported module "ArrBase" #+END_EXAMPLE This is tricky because Hugs doesn’t have an =ArrBase= module. What now? Fixing this required just a small patch, but led to a much bigger problem: PrelGHC does not exist. The core of the Prelude in GHC is implemented inside of the GHC runtime. =PrelGHC.hi-boot= provides an interface, but Hugs cannot understand interface files. Even if it did, that would not help, because Hugs does not implement the primitives that =PrelGHC.hi-boot= declares. Early versions of Haskell included a modified version of Hugs to provide an interpreter. This was before GHCi arrived. GHC 4.08.2 includes modified Hugs sources and links them with the GHC runtime. The GHC runtime implements the primitives and is written in C and “Haskellized-C” (files ending on =.hc=). The compiler driver script (written in Perl) turns these =.hc= files into actual C code. So the next step is to try to build the GHC runtime for version 4.08.2, then to build the modified Hugs interpreter and link it with the runtime, and then to try to interpret the GHC compiler with that modified version of Hugs to compile itself. GHC 4.08.2 is sufficient to build GHC 6.0. 6.6 was released in April 2007. To build 6.6 you need GHC > 5.04 (http://web.archive.org/web/20070426051520/http://hackage.haskell.org:80/trac/ghc/wiki/Building/Prerequisites) or maybe 6.0 (http://web.archive.org/web/20071224035747/http://hackage.haskell.org:80/trac/ghc/wiki/Building/Prerequisites) To build 6.8.* you need GHC >= 6.4 To build 6.10.* you need GHC >= 6.6 To build 7.4.* you need GHC >= 6.12 To build 7.6.* you need GHC >= 7.0.1 So let’s build it like this: 4.08.2 -> 6.0 -> 6.6 -> 6.12(?) -> 7.4 -> 7.6 * building the RTS The RTS contains =.hc= files. Since we don’t have GHC we cannot rely on it to turn these files into regular C code, we have to figure out what kind of transformations need to be performed on these files. One of the tasks is to turn =#line= pragmas into appropriate Haskell pragmas. That’s done by =ghc/utils/hscpp/hscpp.prl=, but it also seems like something that is really not necessary, so maybe we can skip it. Another task appears to be the injection of the header =Stg.h=. That’s done by the =runGcc= subroutine of =ghc.lprl=, a literate Perl script of thousands of lines. For the RTS hc files it just inserts a header include (=#include "Stg.h"=), adds a definition of ghc_cc_ID, which records where the file came from, e.g.: #+BEGIN_SRC c static char ghc_cc_ID[] = "@(#)cc ../../ghc/rts/HeapStackCheck.hc .,,"; #+END_SRC The difference between the included .C files and the .HC source files are trivial. (The included .C files also contain some pointer casts that don't seem to be needed and are not generated with GCC 4.) I'm inclined to treat these .C files as sources. If it were just about adding the Stg.h header include I'd do this: #+BEGIN_SRC sh for a in *.hc; do cat <(echo '#include "Stg.h"') $a > ${a: 0:-3}.c; done #+END_SRC Another task is to mangle embedded assembly code with =ghc-asm.lprl=. I hope that won’t be necessary. ** notes <rekado> Is this valid C? (StgClosure *)tmp = isAlive((StgClosure *)t); [21:57] <rekado> GCC doesn’t like that and says “error: lvalue required as left operand of assignment” [21:58] <rekado> (that’s from the garbage collector code of the GHC runtime system) <`Lion> cast on an lvalue doesn't make sense [21:59] <rekado> there are a lot of statements like that in the code [22:00] <rekado> (StgClosure *)w = evacuate((StgClosure *)w); <rekado> StgWeak *w, **last_w; <`Lion> weird [22:01] <`Lion> ok so they are declared first <rekado> I don’t know what the cast is supposed to achieve. <`Lion> me neither <`Lion> like i said, it doesn't make sense <rekado> ok, I’m just going to remove them and see what happens <rain1> isit c++ and you're compiling it as C? [22:05] <rekado> it’s supposed to be just C. [22:06] <rekado> looks like they used a version of GCC around 2.96. [22:10] <rekado> ugh, there’s more statements like that: { *(--stgCast(StgPtr*,gSp)) = x; } [22:26] <rekado> { *(--stgCast(StgClosure**,gSp)) = x; } <rekado> this is hard to parse for me <`Lion> i'm guessing stgCast is a macro [22:27] <`Lion> it pretty much has to be... first argument is a pointer type [22:28] <rekado> maybe I should get an old GCC first [22:29] <rekado> you’re right: #define stgCast(ty,e) ((ty)(e)) [22:30] <`Lion> yet more reasons why we need better bootstrapping chains, i guess [22:33] <rekado> this code is terrible. <rekado> it’s hundreds of files, each thousands of lines, macro-heavy, and GCC version specific [22:34] <rekado> dependent on preprocessing with Perl, GHC, and CPP <rekado> argh! <`Lion> nasty <rekado> and that’s only the runtime! <`Lion> pretty sure building an old GCC on a new system is a nightmare [22:35] <rekado> yeah [22:36] <`Lion> maybe if you're willing to get some binaries... linux (the kernel) is pretty good at backward compatibility, but glibc and the system headers and libraries.... i dunno <rekado> it’s a pity that the GHC developers abandoned Hugs as the interpreter in version 5 of GHC. From version 5 onwards they use GHCi, which integrates much more tightly with GHC. [22:38] <rekado> so I cannot begin the bootstrap path at a later version of GHC. <rekado> it has to be 4.08.2 <rekado> with a statement like this: { *(--stgCast(StgPtr*,gSp)) = x; } [22:41] <rekado> the problem is that the lvalue is cast <rekado> how would this be written in valid C? [22:42] <`Lion> well, first we have to figure out what it's supposed to achieve <rekado> gSp is taken to be an StgPtr* and the address is subtracted by the size of StgPtr(?); then the value is updated to be x? <rekado> the comments say “Supporting routines for primops” [22:43] <`Lion> i guess gSp's type is some other pointer type? <rekado> this is the chunk: <rekado> http://paste.debian.net/plain/1011702 <`Lion> k <rekado> inlined stack operations? [22:44] <rekado> gSp is a register <`Lion> ok now i have to think [22:45] <`Lion> heh <`Lion> so, it seems to add one pointer level to the type with each cast [22:46] <`Lion> for some reason <rekado> these operations push values of certain types to the stack; they do this by moving up by the size of the value (hence the cast); then at the new address they store the value. <rekado> (that’s a guess) <`Lion> right, seems like it [22:47] <`Lion> but pointer sizes are always the same <rekado> I suppose the equivalent is to subtract the size of the pointer type from gSp and then assign: *gSp = x <rekado> huh, right. <rekado> (have they always been? Were they also the same size in GCC 2.x times?) [22:48] <`Lion> i would think so, but i don't really know <rekado> but wait: StgInt and StgWord and so on: these are not all the same size. [22:49] <`Lion> i feel like it would be more helpful if we were able to look at the disassembly of a binary produced from this code :/ <rekado> heh, probably <`Lion> because then we can see what it actually ends up doing <`Lion> yeah no doubt the base types are differently sized, but... [22:51] <`Lion> ok, so it's probably about the pointer dereference that happens after the decrement [22:52] <`Lion> if you could split them into two statements each [22:53] <`Lion> thinking... <rekado> { gSp -= sizeof(StgWord); *gSp = x; } [22:54] <rekado> instead of { *(--stgCast(StgWord*,gSp)) = x; } <`Lion> something like: --gSp; *((StgPtr *) gSp) = x; <`Lion> hmm <rekado> well, x is of different types, which have different sizes. [22:55] <`Lion> well, yeah <rekado> so I think we need to move not by just a single step for gSp, but by the size of the target type <`Lion> if it's really the value which goes on the stack <rekado> hmm [22:56] <rekado> the value of whatever the new position of gSp points to is changed <rekado> so maybe the size doesn’t matter <rekado> bleh <`Lion> { gSp -= sizeof(StgWord); *((StgWord *) gSp) = x; } <rekado> guessing is no fun [22:57] <`Lion> i think that would make the most sense <rekado> yes, I see <`Lion> subtracts StgWord size from stack pointer, then puts the new StgWord value in the space just allocated <`Lion> and if you want to keep using StgCast i guess it would be: { gSp -= sizeof(StgWord); *(StgCast(StgWord *, gSp)) = x; } [22:58] <rekado> yes, this looks good. [23:00] <rekado> this neatly separates the decrement from the assignment <rekado> a similar issue: { return *stgCast(StgWord*,gSp)++; } [23:02] <rekado> that’s to return the value at gSp as a StgWord, and then increments gSp(?), thereby popping the value of the stack [23:03] <`Lion> probably supposed to be: { gSp += sizeof(StgWord); return *(StgCast(StgWord *, gSp)); } [23:06] <`Lion> one thing though <`Lion> pointer arithmetic does depend on the size of its base type <`Lion> e.g. if you have int32_t *x and you say x++; the address actually increases by 4 (size of int32_t) [23:07] <`Lion> so the way we add or subtract by sizeof(x) only works if gSp's base type is one byte long [23:08] <`Lion> otherwise we do have to cast it before doing the arithmetic <`Lion> so i guess that's what they were doing everywhere [23:09] <`Lion> if GCC really won't accept a cast on an lvalue, you'd have to use an extra temporary variable [23:12] <`Lion> i'm looking at this right now: https://stackoverflow.com/questions/30797925/lvalue-required-as-increment-operand-error-with-old-c-code [23:21] <`Lion> someone is saying: "In other words, the memory location storing p was treated as if it actually stored a pointer of another type, and then that pointer is incremented. This worked on those compilers because those compilers only ran on hardware where all pointers are stored in the same way." [23:22] <`Lion> ok, i think i know what it's supposed to be [23:25] <`Lion> also, i compiled: main() { int *p = 0; ++(*(long **)&p); printf("%p\n", p); return 0; } <`Lion> loosely based on the example given [23:26] <`Lion> and it shows 0x8 <`Lion> so that works <`Lion> how about: { *(--*stgCast(StgWord **, &gSp)) = x; } [23:28] *** OriansJ (~user@itsx01.pdp10.guru) has joined channel #bootstrappable [23:30] <OriansJ> hopefully I didn't miss anything this morning <rain1> hello! <OriansJ> I've been making good progress on M2-Planet and I am preparing for a new release of mescc-tools this weekend [23:31] <`Lion> and: { return *(*stgCast(StgWord **, &gSp))++; } <`Lion> i think you'd rather want to miss this :) <OriansJ> so calling the function stgCast with a double indirected custom type and the address of another custom type. Then loading the pointer of the pointer of the return of the call and incrementing it. [23:32] <`Lion> turns out old really old GHC code only compiles with really old GCCs... <OriansJ> So in short, developer never seem to bother to check to see if the code they write complies with any sort of standard [23:33] <`Lion> rekado was getting the error described here: https://stackoverflow.com/questions/30797925/lvalue-required-as-increment-operand-error-with-old-c-code <`Lion> this is a chunk of the offending code: https://paste.debian.net/plain/1011702 [23:34] <OriansJ> What possibly convinces people that is the sort of code you should ever write? [23:36] <OriansJ> What are they code golfing or something while writing the compiler???? [23:37] <`Lion> heh [23:41] <`Lion> i mean <`Lion> you know what happens when people try to 'just get it working' [23:42] <`Lion> and the correct version here actually looks more horrible [23:43] <OriansJ> It is one thing to hack on something in a private repo but they really need to take some pride in their work [23:44] ERC> # building GHC 6.6.1 The RTS contains =.cmm= files; these are “C minus minus” sources. GHC includes a Cmm compiler, but it depends on GHC libraries, so it cannot be interpreted with Hugs. There is an old implementation of a [[https://github.com/nrnrnr/qc--/blob/master/INSTALL][“Quick C minus minus” compiler]], which is written in OCaml. GHC’s Cmm is a subset of that, but I don’t know if it adds anything to C--, so instead of wasting time packaging =qc--= (which has been unmaintained since around 2007), I’ll try to convert the Cmm source files to C files with GHC. This will show me if this conversion could be done manually. # 2022 edition After taking a break for some years I revisited this problem. I built GHC 4.08.2 with GCC 2.95 which is from the same era and has no problem with the odd C code. I submitted a [preliminary patch](https://issues.guix.gnu.org/53609) to add a `ghc-4` package to Guix, but I was suspicious: this was too easy. And I was right: the resulting package [didn't come with a standard library](https://logs.guix.gnu.org/bootstrappable/2022-01-29.log#105601). I missed this the first time, because this old build system just keeps on going when it encounters actually fatal errors. At least the RTS was built successfully, so I decided to resume the original plan: combine the GHC RTS with the Hugs interpreter, and then attempt to interpret the compiler to compile itself. There were a couple of boring problems like [linker problems](https://logs.guix.gnu.org/bootstrappable/2022-02-02.log#121134), but I eventually managed to build a broken version of STGhugs. It failed to load the Prelude, because the glibc package I used was built without support for shared libraries. Once I had rebuilt glibc 2.2.5 with shared library support I was able to build STGhugs and have it load the Prelude (and pretty much any other included library) without crashing. (This hugs doesn't come with runhugs, which is a little inconvenient, but in the larger scheme of things it's probably not a real obstacle.) ``` $ ./hugs __ __ __ __ ____ ___ _________________________________________ || || || || || || ||__ STGHugs: Based on the Haskell 98 standard ||___|| ||__|| ||__|| __|| Copyright (c) 1994-2000 ||---|| ___|| World Wide Web: http://haskell.org/hugs || || Report bugs to: hugs-bugs@haskell.org || || Version: STGHugs-000425 _________________________________________ Haskell 98 mode: Restart with command line option -98 to enable extensions Standalone mode: Restart with command line +c for combined mode Reading source file "/tmp/guix-build-ghc-4.08.2.drv-0/ghc-4.08.2/ghc/interpreter/lib/Prelude.hs" Reading source file "/tmp/guix-build-ghc-4.08.2.drv-0/ghc-4.08.2/ghc/interpreter/lib/PrelPrim.hs" Hugs session for: PrelPrim Prelude Type :? for help Prelude> ``` Yay! Loading the compiler sources fails because they include preprocessor instructions. So we need to set up a preprocessor first. Luckily, we've already built `hscpp`. We configure the preprocessor and its include path: ``` Prelude> :set +F"/gnu/store/471qh0h8jpaq3m8kw861dr8ii19f5r8g-ghc-4.08.2/lib/hscpp -I/gnu/store/471qh0h8jpaq3m8kw861dr8ii19f5r8g-ghc-4.08.2/lib/ghc/compiler -I/gnu/store/471qh0h8jpaq3m8kw861dr8ii19f5r8g-ghc-4.08.2/lib/includes -D__HUGS__" ``` We also set the module search path: ``` Prelude> :set +P"/gnu/store/471qh0h8jpaq3m8kw861dr8ii19f5r8g-ghc-4.08.2/lib/ghc/lib/std:/gnu/store/471qh0h8jpaq3m8kw861dr8ii19f5r8g-ghc-4.08.2/lib/ghc/compiler/main/:/gnu/store/471qh0h8jpaq3m8kw861dr8ii19f5r8g-ghc-4.08.2/lib/ghc/compiler/absCSyn:/gnu/store/471qh0h8jpaq3m8kw861dr8ii19f5r8g-ghc-4.08.2/lib/ghc/compiler/utils:/gnu/store/471qh0h8jpaq3m8kw861dr8ii19f5r8g-ghc-4.08.2/lib/ghc/compiler/types:/gnu/store/471qh0h8jpaq3m8kw861dr8ii19f5r8g-ghc-4.08.2/lib/hslibs/lang" ``` And then try to load parts of the compiler: ``` Prelude> :load AbsCSyn :load AbsCSyn Reading source file "/gnu/store/471qh0h8jpaq3m8kw861dr8ii19f5r8g-ghc-4.08.2/lib/ghc/compiler/absCSyn/AbsCSyn.lhs" Reading source file "/gnu/store/471qh0h8jpaq3m8kw861dr8ii19f5r8g-ghc-4.08.2/lib/ghc/compiler/utils/BitSet.lhs" Reading source file "./Word.lhs" Reading source file "./Numeric.lhs" Reading source file "./Char.lhs" Reading source file "./Ratio.lhs" Reading source file "./Bits.lhs" Reading source file "/gnu/store/471qh0h8jpaq3m8kw861dr8ii19f5r8g-ghc-4.08.2/lib/ghc/compiler/types/TyCon.lhs" Reading source file "/gnu/store/471qh0h8jpaq3m8kw861dr8ii19f5r8g-ghc-4.08.2/lib/ghc/compiler/utils/Outputable.lhs" Reading source file "/gnu/store/471qh0h8jpaq3m8kw861dr8ii19f5r8g-ghc-4.08.2/lib/hslibs/lang/Foreign.lhs" Reading source file "./IOExts.lhs" Reading source file "./IO.lhs" Reading source file "./Monad.lhs" Reading source file "./ST.lhs" Reading source file "/gnu/store/471qh0h8jpaq3m8kw861dr8ii19f5r8g-ghc-4.08.2/lib/hslibs/lang/Storable.lhs" Parsing ERROR "Storable.lhs" (line 690): Syntax error in module definition (unexpected comma) ``` This is stupid: Storable.lhs (and other files) contain syntax errors and Hugs is picky about those. That file contains some lines with a trailing comma followed by another comma on the next line. This needs patching... We also need to patch a whole bunch of other files that use trailing #, which are for GHC-internal values. We replace them with ' or remove them. Hugs also doesn't like the "foreign import" lines in Storable.lhs that contain the word "unsafe". Hugs is rather cautious. Ultimately, I feel that we might be doomed because we cannot use the "combined mode", which requires compiled Haskell code. We're stuck with standalone mode. My goal is to load at part of the compiler sources and then try to cobble together a working compiler. Is this all worth the effort when we could get generated C files for the hslib and the compiler sources here[1]...? Using those would still be slightly better than depending on a big GHC binary, no...? [1]: http://downloads.haskell.org/~ghc/4.08.2/ghc-4.08.2-hc-unreg.tar.bz2 or these: http://downloads.haskell.org/~ghc/4.08.2/ghc-4.08.2-x86-hc.tar.bz2