mirror of
https://codeberg.org/guix/guix.git
synced 2026-01-25 03:55:08 -06:00
doc: cookbook: Add “Reproducible Research” chapter.
* doc/guix-cookbook.texi (Reproducible Research): New node. Change-Id: I73d12771a2c2b5717b8f553dacae272f509a9fed
This commit is contained in:
parent
e0e64be8de
commit
9da40e7bc3
1 changed files with 252 additions and 1 deletions
|
|
@ -22,10 +22,13 @@ Copyright @copyright{} 2020 André Batista@*
|
|||
Copyright @copyright{} 2020 Christine Lemmer-Webber@*
|
||||
Copyright @copyright{} 2021 Joshua Branson@*
|
||||
Copyright @copyright{} 2022, 2023 Maxim Cournoyer@*
|
||||
Copyright @copyright{} 2023-2024 Ludovic Courtès@*
|
||||
Copyright @copyright{} 2023-2025 Ludovic Courtès@*
|
||||
Copyright @copyright{} 2023 Thomas Ieong@*
|
||||
Copyright @copyright{} 2024 Florian Pelz@*
|
||||
Copyright @copyright{} 2025 45mg@*
|
||||
Copyright @copyright{} 2023 Marek Felšöci@*
|
||||
Copyright @copyright{} 2023 Konrad Hinsen@*
|
||||
Copyright @copyright{} 2023 Philippe Swartvagher@*
|
||||
|
||||
Permission is granted to copy, distribute and/or modify this document
|
||||
under the terms of the GNU Free Documentation License, Version 1.3 or
|
||||
|
|
@ -90,6 +93,7 @@ Manual}).
|
|||
* Advanced package management:: Power to the users!
|
||||
* Software Development:: Environments, continuous integration, etc.
|
||||
* Environment management:: Control environment
|
||||
* Reproducible Research:: A foundation for reproducible research.
|
||||
* Installing Guix on a Cluster:: High-performance computing.
|
||||
* Guix System Management:: System Management specifics.
|
||||
|
||||
|
|
@ -210,6 +214,13 @@ Environment management
|
|||
|
||||
* Guix environment via direnv:: Setup Guix environment with direnv
|
||||
|
||||
Using Guix for Reproducible Research
|
||||
|
||||
* Setting Up the Environment:: Step 1: using `guix shell'.
|
||||
* Recording the Environment:: Step 2: using `guix describe'.
|
||||
* Ensuring Long-Term Source Code Archiving:: Step 3: Software Heritage.
|
||||
* Referencing the Software Environment:: Step 4: SWHIDs.
|
||||
|
||||
Installing Guix on a Cluster
|
||||
|
||||
* Setting Up a Head Node:: The node that runs the daemon.
|
||||
|
|
@ -5656,6 +5667,246 @@ will have predefined environment variables and procedures.
|
|||
Run @command{direnv allow} to setup the environment for the first time.
|
||||
|
||||
|
||||
@c *********************************************************************
|
||||
@node Reproducible Research
|
||||
@chapter Using Guix for Reproducible Research
|
||||
|
||||
@cindex reproducible research
|
||||
Because it supports reproducible deployment, Guix is a solid foundation
|
||||
for @dfn{reproducible research workflows}. This section is targeted at
|
||||
scientists; it shows how to add Guix to one's reproducible research
|
||||
toolbox@footnote{This chapter is adapted from a
|
||||
@uref{https://hpc.guix.info/blog/2023/06/a-guide-to-reproducible-research-papers/,
|
||||
blog post published on the Guix-HPC web site in 2023.}.}.
|
||||
|
||||
With Guix as the basis of your computational workflow, you can get
|
||||
what's in essence @emph{executable provenance meta-data}: it's like the
|
||||
list of package name/version pairs some provide as an appendix to their
|
||||
publication, except more precise and immediately deployable.
|
||||
|
||||
This section is a guide in just four steps on how to make your
|
||||
computational experiments reproducible using Guix, and how to provide
|
||||
that information in your research paper.
|
||||
|
||||
@menu
|
||||
* Setting Up the Environment:: Step 1: using `guix shell'.
|
||||
* Recording the Environment:: Step 2: using `guix describe'.
|
||||
* Ensuring Long-Term Source Code Archiving:: Step 3: Software Heritage.
|
||||
* Referencing the Software Environment:: Step 4: SWHIDs.
|
||||
@end menu
|
||||
|
||||
@node Setting Up the Environment
|
||||
@section Step 1: Setting Up the Environment
|
||||
|
||||
The first step is to identify precisely what packages you need in
|
||||
your software environment to run your computational experiment.
|
||||
|
||||
Assuming you have a Python script that uses NumPy, you can start by
|
||||
creating an environment that contains these two packages and
|
||||
to run your code in that environment (@pxref{Invoking guix shell,,,
|
||||
guix, GNU Guix Reference Manual}):
|
||||
|
||||
@example
|
||||
guix shell -C python python-numpy -- python3 ./myscript.py
|
||||
@end example
|
||||
|
||||
The @code{-C} flag here (or @code{--container}) instructs @command{guix
|
||||
shell} to create that environment in an isolated container with nothing
|
||||
but the two packages you asked for. That way, if
|
||||
@command{./myscript.py} needs more than these two packages, it'll fail
|
||||
to run and you'll immediately notice. On some systems
|
||||
@code{--container} is not supported; in that case, you can resort to
|
||||
@code{--pure} instead.
|
||||
|
||||
Perhaps you'll find that you also need Pandas and add it to the
|
||||
environment:
|
||||
|
||||
@example
|
||||
guix shell -C python python-numpy python-pandas -- \
|
||||
python3 ./myscript.py
|
||||
@end example
|
||||
|
||||
If you fail to guess the name of the package (this one was easy!), try
|
||||
@code{guix search}.
|
||||
|
||||
Environments for Python, R, and similar high-level languages are
|
||||
relatively easy to set up. For C/C++ code, you may find need many more
|
||||
packages:
|
||||
|
||||
@example
|
||||
guix shell -C gcc-toolchain cmake coreutils grep sed make -- @dots{}
|
||||
@end example
|
||||
|
||||
Or perhaps you'll find that you could just as well provide a
|
||||
for your package---@pxref{Defining Packages,,, guix, GNU Guix Reference
|
||||
Manual}, to learn more on how to do that.
|
||||
|
||||
Eventually, you'll have a list of packages that satisfies your needs.
|
||||
|
||||
@quotation What if a package is missing?
|
||||
Guix and the main scientific channels provide about
|
||||
@uref{https://hpc.guix.info/browse, tens of thousands of packages}.
|
||||
Yet, there's always the possibility that the one package you need is
|
||||
missing.
|
||||
|
||||
In that case, you will need to provide a definition for it
|
||||
(@pxref{Defining Packages,,, guix, GNU Guix Reference Manual}) in a
|
||||
dedicated channel of yours (@pxref{Creating a Channel,,, guix, GNU Guix
|
||||
Reference Manual}). For software in Python, R, and other high-level
|
||||
languages, most of the work can usually be automated by using
|
||||
@command{guix import} (@pxref{Invoking guix import,,, guix, GNU Guix
|
||||
Reference Manual}).
|
||||
|
||||
Join
|
||||
@uref{https://guix.gnu.org/contact/,the friendly Guix community} to get
|
||||
help!
|
||||
@end quotation
|
||||
|
||||
@node Recording the Environment
|
||||
@section Step 2: Recording the Environment
|
||||
|
||||
Now that you have that @code{guix shell} command line with a list of
|
||||
packages, the best course of action is to save it in a @emph{manifest}
|
||||
file---essentially a software bill of materials---that Guix can then
|
||||
ingest (@pxref{Writing Manifests,,, guix, GNU Guix Reference Manual}).
|
||||
The easiest way to get started is by ``translating'' your command line
|
||||
into a manifest:
|
||||
|
||||
@example
|
||||
guix shell python python-numpy python-pandas \
|
||||
--export-manifest > manifest.scm
|
||||
@end example
|
||||
|
||||
Put that manifest under version control! From there anyone can redeploy
|
||||
the software environment described by the manifest and run code in that
|
||||
environment:
|
||||
|
||||
@example
|
||||
guix shell -C -m manifest.scm -- python3 ./myscript.py
|
||||
@end example
|
||||
|
||||
Here's what @file{manifest.scm} reads:
|
||||
|
||||
@lisp
|
||||
;; What follows is a "manifest" equivalent to the command line you gave.
|
||||
;; You can store it in a file that you may then pass to any 'guix' command
|
||||
;; that accepts a '--manifest' (or '-m') option.
|
||||
|
||||
(specifications->manifest
|
||||
(list "python" "python-numpy" "python-pandas"))
|
||||
@end lisp
|
||||
|
||||
It's a code snippet that lists packages. Notice that there are no
|
||||
version numbers! Indeed, these version numbers are specified in package
|
||||
definitions, located in Guix channels. To allow others to reproduce the
|
||||
exact same environment as the one you're running, you need to @emph{pin
|
||||
Guix itself} , by capturing the current Guix channel commits with
|
||||
@command{guix describe} (@pxref{Replicating Guix,,, guix, GNU Guix
|
||||
Reference Manual}):
|
||||
|
||||
@example
|
||||
guix describe -f channels > channels.scm
|
||||
@end example
|
||||
|
||||
@cindex lock files, for reproducibility
|
||||
This @code{channels.scm} file is similar in spirit to ``lock files''
|
||||
that some deployment tools employ to pin package revisions. You should
|
||||
also keep it under version control in your code, and possibly update it
|
||||
once in a while when you feel like running your code against newer
|
||||
versions of its dependencies. With this file, anyone, @emph{at any time
|
||||
and on any machine}, can now reproduce the exact same environment by
|
||||
running:
|
||||
|
||||
@example
|
||||
guix time-machine -C channels.scm -- \
|
||||
shell -C -m manifest.scm -- \
|
||||
python3 ./myscript.py
|
||||
@end example
|
||||
|
||||
In this example we rely solely on the @code{guix} channel, which
|
||||
provides the Python packages we need. Perhaps some of the packages you
|
||||
need live @uref{https://hpc.guix.info/channels,in other
|
||||
channels}---maybe @code{guix-cran} if you use R, maybe
|
||||
@code{guix-science}. That's fine: @code{guix describe} also captures
|
||||
that.
|
||||
|
||||
Of course do include a @file{README} file giving the exact command to
|
||||
run the code. Not everyone uses Guix so it can be helpful to also
|
||||
provide minimal non-Guix setup instructions: which package versions are
|
||||
used, how software is built, etc. As we have seen, such instructions
|
||||
would likely be inaccurate and inconvenient to follow at best. Yet, it
|
||||
can be a useful starting point to someone trying to recreate a
|
||||
@emph{similar} environment using different tools. It should probably be
|
||||
presented as such, with the understanding that the only way to get the
|
||||
@emph{same} environment is to use Guix.
|
||||
|
||||
@node Ensuring Long-Term Source Code Archiving
|
||||
@section Step 3: Ensuring Long-Term Source Code Archiving
|
||||
|
||||
We insisted on version control before: for the @file{manifest.scm} and
|
||||
@file{channels.scm} files, but of course also for your own code. Our
|
||||
recommendation is to have these two @file{.scm} files in the same
|
||||
repository as the code they're about.
|
||||
|
||||
Since the goal is enabling reproducibility, source code availability is
|
||||
a prime concern. Source code hosting services come and go and we don't
|
||||
want our code to vanish in a whim and render our published research work
|
||||
unverifiable. @uref{https://www.softwareheritage.org/,Software Heritage}
|
||||
(SWH for short) is @emph{the} solution for this: SWH archives public
|
||||
source code and provides unique intrinsic identifiers to refer to
|
||||
it---@uref{https://swhid.org, @dfn{SWHIDs}}.
|
||||
Guix itself is
|
||||
@uref{https://doi.org/10.1145/3641525.3663622,connected
|
||||
to SWH} to (1)@ ensure that the source code of its packages is archived,
|
||||
and (2)@ to fall back to downloading from the SWH archive should code
|
||||
vanish from its original site.
|
||||
|
||||
Once your own code is available in a public version-control repository,
|
||||
such as a Git repository on your lab's hosting service, you can ask SWH
|
||||
to archive it by going to its
|
||||
@uref{https://archive.softwareheritage.org/save/,Save Code Now}
|
||||
interface. SWH will process the request asynchronously and eventually
|
||||
you'll find your code has made it into
|
||||
@uref{https://archive.softwareheritage.org/,the archive}.
|
||||
|
||||
@node Referencing the Software Environment
|
||||
@section Step 4: Referencing the Software Environment
|
||||
|
||||
This brings us to the last step: referring to our code @emph{and}
|
||||
software environment in our beloved paper. We already have all our code
|
||||
and Guix files in the same repository, which is archived on SWH. Thanks
|
||||
to SWH, we now have a SWHID, which uniquely identifies the relevant
|
||||
revision of our code.
|
||||
|
||||
Following
|
||||
@uref{https://www.softwareheritage.org/howto-archive-and-reference-your-code/,SWH's
|
||||
own guide}, we'll pick an @code{swh:dir} kind of identifier, which
|
||||
refers to the directory of the relevant revision/commit of our
|
||||
repository, and we'll keep @emph{contextual info} for clarity---that
|
||||
includes the original URL. Putting it all together, we'll conclude our
|
||||
paper with a sentence along these lines:
|
||||
|
||||
@quotation Example
|
||||
The source code used to produce this study, as well as instructions to
|
||||
run it in the right software environment using GNU@ Guix, is archived on
|
||||
Software Heritage as
|
||||
@uref{https://archive.softwareheritage.org/swh:1:dir:cc8919d7705fbaa31efa677ce00bef7eb374fb80;origin=https://gitlab.inria.fr/lcourtes-phd/edcc-2006-redone;visit=swh:1:snp:71a4d08ef4a2e8455b67ef0c6b82349e82870b46;anchor=swh:1:rev:36fde7e5ba289c4c3e30d9afccebbe0cfe83853a,@code{swh:1:dir:cc8919d7705fbaa31efa677ce00bef7eb374fb80;origin=https://gitlab.inria.fr/lcourtes-phd/edcc-2006-redone;visit=swh:1:snp:71a4d08ef4a2e8455b67ef0c6b82349e82870b46;anchor=swh:1:rev:36fde7e5ba289c4c3e30d9afccebbe0cfe83853a}}.
|
||||
@end quotation
|
||||
|
||||
With this information, the reader can:
|
||||
|
||||
@itemize
|
||||
@item
|
||||
get the source code;
|
||||
@item
|
||||
reproduce its software environment with @code{guix time-machine} and run
|
||||
the code;
|
||||
@item
|
||||
inspect and possibly modify both the code and its environment.
|
||||
@end itemize
|
||||
|
||||
Mission accomplished!
|
||||
|
||||
@c *********************************************************************
|
||||
@node Installing Guix on a Cluster
|
||||
@chapter Installing Guix on a Cluster
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue