I'm starting with GHC 4.08.2.

#+BEGIN_SRC sh
wget http://downloads.haskell.org/~ghc/4.08.2/ghc-4.08.2-src.tar.bz2
tar xf ghc-4.08.2-src.tar.bz2
#+END_SRC

The compiler's entry point is =ghc-4.08.2/ghc/compiler/main/Main.lhs=, which is a Literate Haskell source file.  Hugs can process Literate Haskell files, so I don’t need to build the =unlit= tool first.  =Main.lhs= requires pre-processing with CPPHS.  Hugs comes with an implementation of CPPHS.  Although Hugs supports calling out to a pre-processor as it reads source files, for better performance it is advisable to pre-process all source files ahead of time.

The first problem is that GHC includes a parser grammar source file =ghc/compiler/parser/Parser.y= that needs to be processed with Happy, the Haskell parser generator.  Happy is, of course, written in Haskell.

But wait!  Hugs needs to parse Haskell source code as well, so certainly it should have a parser module, right?  Indeed!

#+BEGIN_SRC sh
cp /gnu/store/dv277478vmyxklisjgvzgbwj562ky9zs-hugs-Sep2006/lib/hugs/packages/haskell-src/Language/Haskell/Parser.hs ghc/compiler/parser/
#+END_SRC

And now we can try to run GHC on top of Hugs:

#+BEGIN_SRC sh
runhugs -F"/gnu/store/kw69dvs14ddbb2d7s8b2v20hh54v29nz-cpphs-on-hugs-1.13.3/bin/cpphs -I$PWD/ghc/compiler" -P"{Hugs}/packages/*:$PWD/ghc/compiler/*:$PWD/ghc/compiler/utils/*" ghc/compiler/main/Main.lhs
#+END_SRC

Of course, this fails.  GHC contains modules that recursively depend on one another.

+ merged =ghc/compiler/hsSyn/Hs{Expr,Binds,Matches}.lhs= to  =ghc/compiler/hsSyn/HsExprBindsMatches.lhs=, then redefined =HsExpr.hs=, =HsBinds.hs=, and =HsMatches.hs= in terms of that bigger module, re-exporting all definitions.

+ replaced =panic#= with =panic'= and =pprPanic#= with =pprPanic'=, because Hugs doesn’t like the unboxing character.  This might lead to problems later on.

Try again:

#+BEGIN_SRC sh
runhugs +98 -F"/gnu/store/kw69dvs14ddbb2d7s8b2v20hh54v29nz-cpphs-on-hugs-1.13.3/bin/cpphs -I$PWD/ghc/compiler" -P"{Hugs}/packages/*:$PWD/ghc/compiler/*:$PWD/ghc/compiler/utils/*:$PWD/hslibs/lang/*:$PWD/hslibs/*" ghc/compiler/main/Main.lhs
#+END_SRC

Now I get this error:

#+BEGIN_EXAMPLE
runhugs: Error occurred
ERROR "ghc/compiler/main/CmdLineOpts.lhs" - Can't find imported module "ArrBase"
#+END_EXAMPLE

This is tricky because Hugs doesn’t have an =ArrBase= module.  What now?

Fixing this required just a small patch, but led to a much bigger problem: PrelGHC does not exist.  The core of the Prelude in GHC is implemented inside of the GHC runtime.  =PrelGHC.hi-boot= provides an interface, but Hugs cannot understand interface files.  Even if it did, that would not help, because Hugs does not implement the primitives that =PrelGHC.hi-boot= declares.

Early versions of Haskell included a modified version of Hugs to provide an interpreter.  This was before GHCi arrived.  GHC 4.08.2 includes modified Hugs sources and links them with the GHC runtime.  The GHC runtime implements the primitives and is written in C and “Haskellized-C” (files ending on =.hc=).  The compiler driver script (written in Perl) turns these =.hc= files into actual C code.

So the next step is to try to build the GHC runtime for version 4.08.2, then to build the modified Hugs interpreter and link it with the runtime, and then to try to interpret the GHC compiler with that modified version of Hugs to compile itself.

GHC 4.08.2 is sufficient to build GHC 6.0.

6.6 was released in April 2007.
To build 6.6 you need GHC > 5.04 (http://web.archive.org/web/20070426051520/http://hackage.haskell.org:80/trac/ghc/wiki/Building/Prerequisites) or maybe 6.0 (http://web.archive.org/web/20071224035747/http://hackage.haskell.org:80/trac/ghc/wiki/Building/Prerequisites)

To build 6.8.* you need GHC >= 6.4
To build 6.10.* you need GHC >= 6.6
To build 7.4.* you need GHC >= 6.12
To build 7.6.* you need GHC >= 7.0.1

So let’s build it like this:
4.08.2 -> 6.0 -> 6.6 -> 6.12(?) -> 7.4 -> 7.6


* building the RTS

The RTS contains =.hc= files.  Since we don’t have GHC we cannot rely on it to turn these files into regular C code, we have to figure out what kind of transformations need to be performed on these files.

One of the tasks is to turn =#line= pragmas into appropriate Haskell pragmas.  That’s done by =ghc/utils/hscpp/hscpp.prl=, but it also seems like something that is really not necessary, so maybe we can skip it.  Another task appears to be the injection of the header =Stg.h=.  That’s done by the =runGcc= subroutine of =ghc.lprl=, a literate Perl script of thousands of lines.  For the RTS hc files it just inserts a header include (=#include "Stg.h"=), adds a definition of ghc_cc_ID, which records where the file came from, e.g.:

#+BEGIN_SRC c
    static char ghc_cc_ID[] = "@(#)cc ../../ghc/rts/HeapStackCheck.hc   .,,";
#+END_SRC

The difference between the included .C files and the .HC source files
are trivial.  (The included .C files also contain some pointer casts
that don't seem to be needed and are not generated with GCC 4.) I'm
inclined to treat these .C files as sources.

If it were just about adding the Stg.h header include I'd do this:

#+BEGIN_SRC sh
for a in *.hc; do cat <(echo '#include "Stg.h"') $a > ${a: 0:-3}.c; done
#+END_SRC

Another task is to mangle embedded assembly code with =ghc-asm.lprl=.
I hope that won’t be necessary.




** notes
<rekado> Is this valid C? (StgClosure *)tmp = isAlive((StgClosure *)t);
                                                                        [21:57]
<rekado> GCC doesn’t like that and says “error: lvalue required as left
         operand of assignment”  [21:58]
<rekado> (that’s from the garbage collector code of the GHC runtime system)
<`Lion> cast on an lvalue doesn't make sense  [21:59]
<rekado> there are a lot of statements like that in the code  [22:00]
<rekado> (StgClosure *)w = evacuate((StgClosure *)w);
<rekado> StgWeak *w, **last_w;
<`Lion> weird  [22:01]
<`Lion> ok so they are declared first
<rekado> I don’t know what the cast is supposed to achieve.
<`Lion> me neither
<`Lion> like i said, it doesn't make sense
<rekado> ok, I’m just going to remove them and see what happens
<rain1> isit c++ and you're compiling it as C?  [22:05]
<rekado> it’s supposed to be just C.  [22:06]
<rekado> looks like they used a version of GCC around 2.96.  [22:10]
<rekado> ugh, there’s more statements like that: { *(--stgCast(StgPtr*,gSp))
         = x; }  [22:26]
<rekado> { *(--stgCast(StgClosure**,gSp)) = x; }
<rekado> this is hard to parse for me
<`Lion> i'm guessing stgCast is a macro  [22:27]
<`Lion> it pretty much has to be... first argument is a pointer type  [22:28]
<rekado> maybe I should get an old GCC first  [22:29]
<rekado> you’re right: #define stgCast(ty,e) ((ty)(e))  [22:30]
<`Lion> yet more reasons why we need better bootstrapping chains, i guess
                                                                        [22:33]
<rekado> this code is terrible.
<rekado> it’s hundreds of files, each thousands of lines, macro-heavy, and GCC
         version specific  [22:34]
<rekado> dependent on preprocessing with Perl, GHC, and CPP
<rekado> argh!
<`Lion> nasty
<rekado> and that’s only the runtime!
<`Lion> pretty sure building an old GCC on a new system is a nightmare  [22:35]
<rekado> yeah  [22:36]
<`Lion> maybe if you're willing to get some binaries... linux (the kernel) is
        pretty good at backward compatibility, but glibc and the system
        headers and libraries.... i dunno
<rekado> it’s a pity that the GHC developers abandoned Hugs as the interpreter
         in version 5 of GHC.  From version 5 onwards they use GHCi, which
         integrates much more tightly with GHC.  [22:38]
<rekado> so I cannot begin the bootstrap path at a later version of GHC.
<rekado> it has to be 4.08.2
<rekado> with a statement like this: { *(--stgCast(StgPtr*,gSp))  = x; }
                                                                        [22:41]
<rekado> the problem is that the lvalue is cast
<rekado> how would this be written in valid C?  [22:42]
<`Lion> well, first we have to figure out what it's supposed to achieve
<rekado> gSp is taken to be an StgPtr* and the address is subtracted by the
         size of StgPtr(?); then the value is updated to be x?
<rekado> the comments say “Supporting routines for primops”  [22:43]
<`Lion> i guess gSp's type is some other pointer type?
<rekado> this is the chunk:
<rekado> http://paste.debian.net/plain/1011702
<`Lion> k
<rekado> inlined stack operations?  [22:44]
<rekado> gSp is a register
<`Lion> ok now i have to think  [22:45]
<`Lion> heh
<`Lion> so, it seems to add one pointer level to the type with each cast
                                                                        [22:46]
<`Lion> for some reason
<rekado> these operations push values of certain types to the stack; they do
         this by moving up by the size of the value (hence the cast); then at
         the new address they store the value.
<rekado> (that’s a guess)
<`Lion> right, seems like it  [22:47]
<`Lion> but pointer sizes are always the same
<rekado> I suppose the equivalent is to subtract the size of the pointer type
         from gSp and then assign: *gSp = x
<rekado> huh, right.
<rekado> (have they always been?  Were they also the same size in GCC 2.x
         times?)  [22:48]
<`Lion> i would think so, but i don't really know
<rekado> but wait: StgInt and StgWord and so on: these are not all the same
         size.  [22:49]
<`Lion> i feel like it would be more helpful if we were able to look at the
        disassembly of a binary produced from this code :/
<rekado> heh, probably
<`Lion> because then we can see what it actually ends up doing
<`Lion> yeah no doubt the base types are differently sized, but...  [22:51]
<`Lion> ok, so it's probably about the pointer dereference that happens after
        the decrement  [22:52]
<`Lion> if you could split them into two statements each  [22:53]
<`Lion> thinking...
<rekado> { gSp -= sizeof(StgWord); *gSp = x; }  [22:54]
<rekado> instead of { *(--stgCast(StgWord*,gSp)) = x; }
<`Lion> something like: --gSp; *((StgPtr *) gSp) = x;
<`Lion> hmm
<rekado> well, x is of different types, which have different sizes.  [22:55]
<`Lion> well, yeah
<rekado> so I think we need to move not by just a single step for gSp, but by
         the size of the target type
<`Lion> if it's really the value which goes on the stack
<rekado> hmm  [22:56]
<rekado> the value of whatever the new position of gSp points to is changed
<rekado> so maybe the size doesn’t matter
<rekado> bleh
<`Lion> { gSp -= sizeof(StgWord); *((StgWord *) gSp) = x; }
<rekado> guessing is no fun  [22:57]
<`Lion> i think that would make the most sense
<rekado> yes, I see
<`Lion> subtracts StgWord size from stack pointer, then puts the new StgWord
        value in the space just allocated
<`Lion> and if you want to keep using StgCast i guess it would be: { gSp -=
        sizeof(StgWord); *(StgCast(StgWord *, gSp)) = x; }  [22:58]
<rekado> yes, this looks good.  [23:00]
<rekado> this neatly separates the decrement from the assignment
<rekado> a similar issue: { return *stgCast(StgWord*,gSp)++; }  [23:02]
<rekado> that’s to return the value at gSp as a StgWord, and then increments
         gSp(?), thereby popping the value of the stack  [23:03]
<`Lion> probably supposed to be: { gSp += sizeof(StgWord); return
        *(StgCast(StgWord *, gSp)); }  [23:06]
<`Lion> one thing though
<`Lion> pointer arithmetic does depend on the size of its base type
<`Lion> e.g. if you have int32_t *x and you say x++; the address actually
        increases by 4 (size of int32_t)  [23:07]
<`Lion> so the way we add or subtract by sizeof(x) only works if gSp's base
        type is one byte long  [23:08]
<`Lion> otherwise we do have to cast it before doing the arithmetic
<`Lion> so i guess that's what they were doing everywhere  [23:09]
<`Lion> if GCC really won't accept a cast on an lvalue, you'd have to use an
        extra temporary variable  [23:12]
<`Lion> i'm looking at this right now:
        https://stackoverflow.com/questions/30797925/lvalue-required-as-increment-operand-error-with-old-c-code
                                                                        [23:21]
<`Lion> someone is saying: "In other words, the memory location storing p was
        treated as if it actually stored a pointer of another type, and then
        that pointer is incremented. This worked on those compilers because
        those compilers only ran on hardware where all pointers are stored in
        the same way."  [23:22]
<`Lion> ok, i think i know what it's supposed to be  [23:25]
<`Lion> also, i compiled: main() { int *p = 0; ++(*(long **)&p);
        printf("%p\n", p); return 0; }
<`Lion> loosely based on the example given  [23:26]
<`Lion> and it shows 0x8
<`Lion> so that works
<`Lion> how about: { *(--*stgCast(StgWord **, &gSp)) = x; }  [23:28]
*** OriansJ (~user@itsx01.pdp10.guru) has joined channel #bootstrappable
                                                                        [23:30]
<OriansJ> hopefully I didn't miss anything this morning
<rain1> hello!
<OriansJ> I've been making good progress on M2-Planet and I am preparing for a
          new release of mescc-tools this weekend  [23:31]
<`Lion> and: { return *(*stgCast(StgWord **, &gSp))++; }
<`Lion> i think you'd rather want to miss this :)
<OriansJ> so calling the function stgCast with a double indirected custom type
          and the address of another custom type. Then loading the pointer of
          the pointer of the return of the call and incrementing it.  [23:32]
<`Lion> turns out old really old GHC code only compiles with really old
        GCCs...
<OriansJ> So in short, developer never seem to bother to check to see if the
          code they write complies with any sort of standard  [23:33]
<`Lion> rekado was getting the error described here:
        https://stackoverflow.com/questions/30797925/lvalue-required-as-increment-operand-error-with-old-c-code
<`Lion> this is a chunk of the offending code:
        https://paste.debian.net/plain/1011702  [23:34]
<OriansJ> What possibly convinces people that is the sort of code you should
          ever write?  [23:36]
<OriansJ> What are they code golfing or something while writing the
          compiler????  [23:37]
<`Lion> heh  [23:41]
<`Lion> i mean
<`Lion> you know what happens when people try to 'just get it working'  [23:42]
<`Lion> and the correct version here actually looks more horrible  [23:43]
<OriansJ> It is one thing to hack on something in a private repo but they
          really need to take some pride in their work  [23:44]
ERC> 


* building GHC 6.6.1

The RTS contains =.cmm= files; these are “C minus minus” sources.  GHC includes a Cmm compiler, but it depends on GHC libraries, so it cannot be interpreted with Hugs.  There is an old implementation of a [[https://github.com/nrnrnr/qc--/blob/master/INSTALL][“Quick C minus minus” compiler]], which is written in OCaml.  GHC’s Cmm is a subset of that, but I don’t know if it adds anything to C--, so instead of wasting time packaging =qc--= (which has been unmaintained since around 2007), I’ll try to convert the Cmm source files to C files with GHC.  This will show me if this conversion could be done manually.

Generated by Ricardo Wurmus using scpaste at Thu Jan 27 21:47:09 2022. CET. (original)