My Gentoo make.conf Explained: USE Flags, CFLAGS, and Beyond

max seefeld

May 26, 2025 • 25 min read

Gentoo’s power lies in its flexibility – by tuning your system’s configuration, you can squeeze out extra performance and tailor the OS for specific needs. In this post, I’ll walk through an optimized /etc/portage/make.conf for my development workstation and explain each part. This machine is an Intel Core i7-14700K (Raptor Lake) running Gentoo with a custom kernel and the Xen hypervisor. The focus is on AI/ML workloads, so we’ll cover everything from CPU optimization flags and USE flags for math libraries, to portage tweaks for parallel builds and virtualization-friendly settings. The tone here is humble and practical – this isn’t the one “perfect” config for everyone, but it’s a setup that works well for an advanced Gentoo box doing heavy machine learning tasks.

Complete `make.conf` Example

First, here’s the complete make.conf for reference. We’ll break down each section below:

# CPU and Compiler Settings
CHOST="x86_64-pc-linux-gnu"
COMMON_FLAGS="-march=raptorlake -O2 -pipe"
CFLAGS="${COMMON_FLAGS}"
CXXFLAGS="${COMMON_FLAGS}"
FCFLAGS="${COMMON_FLAGS}"
FFLAGS="${COMMON_FLAGS}"
LDFLAGS="-Wl,-O1 -Wl,--as-needed"

# Use Flags
USE="cuda opencl openmp blas lapack xen -gtk -qt5 -kde -gnome"
CPU_FLAGS_X86="aes avx avx2 f16c fma3 mmx mmxext pclmul popcnt rdrand sha sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx"

# Parallel Build Options
MAKEOPTS="-j28"
EMERGE_DEFAULT_OPTS="--jobs=2 --load-average=28"

# Portage Features and advanced options
FEATURES="ccache parallel-fetch"
VIDEO_CARDS="nvidia"
GRUB_PLATFORMS="efi-64 xen"
# Limit QEMU targets to reduce build time (for Xen toolstack)
QEMU_SOFTMMU_TARGETS="x86_64"
QEMU_USER_TARGETS="x86_64"

Now, let’s go through each of these settings, understand what they do, and see why they’re configured this way.

CPU Architecture and CFLAGS

CHOST – This is the target triplet for the system. Here it’s x86_64-pc-linux-gnu, which is standard for 64-bit Gentoo on Intel/AMD CPUs. This is set by the stage3 and rarely needs changing for native installs.

CFLAGS and CXXFLAGS – These flags tell the compiler how to optimize code. We define a COMMON_FLAGS variable to avoid repetition, then use it for both CFLAGS (for C code) and CXXFLAGS (for C++). Let’s break down the flags:

-march=raptorlake – This flag optimizes generated code specifically for the Intel Raptor Lake CPU family (which the i7-14700K belongs to). GCC 13+ added explicit support for Raptor Lake. By using -march=raptorlake, the compiler will enable all instructions and optimizations available on this CPU (e.g. SSE4.2, AVX2, FMA3, AES-NI, etc.) and tune the code for its microarchitecture. This is similar to using -march=native (which auto-detects the local CPU) but we specify it explicitly for clarity. The benefit is that our binaries will run only on this CPU (or others with the same capabilities), but in exchange they’ll run faster by using modern instructions.

-O2 – This is the optimization level. -O2 enables a good balance of optimizations that improve performance without increasing compile-time or binary size too much. Gentoo uses -O2 by default because it’s a safe choice. Using -O3 is tempting for maximum speed, but it enables more aggressive optimizations that sometimes break compilation or even produce buggy binaries. In practice, -O3 can yield performance gains in heavy numerical code, but a few packages are known to misbehave under -O3. For a stable system, I stick with -O2 globally, and if needed I can enable -O3 selectively for certain math libraries (using per-package settings, which we’ll discuss later). The Gentoo Wiki notes that -O3 “is not recommended unless running test suites for packages” due to potential breakages – good advice to heed unless you enjoy debugging random issues.

-pipe – This flag uses memory pipes rather than temporary files for communication between the stages of compilation. It doesn’t affect the generated code or runtime performance, but it can speed up compile times and reduce disk I/O during emerges. It’s basically free performance for compilation, so it’s almost always enabled.

We also apply the same COMMON_FLAGS to Fortran compiler flags (FCFLAGS and FFLAGS). This is important for scientific computing and machine learning, because many math libraries (BLAS/LAPACK, SciPy, R, etc.) use Fortran. By setting these, any Fortran code will also be built with -march=raptorlake -O2 -pipe, ensuring optimized numerical routines.

LDFLAGS – These flags are passed to the linker. We set -Wl,-O1 -Wl,--as-needed, which are Gentoo’s default recommendations. -Wl,-O1 tells the linker to optimize for speed and size, and --as-needed means don’t link in libraries that aren’t actually used by the program. This helps reduce bloat and startup time for executables by dropping unnecessary shared library dependencies. Gentoo’s base profile already sets safe LDFLAGS like these, so we’re just being explicit. There’s usually no need to go beyond this (aggressive LDFLAGS rarely give additional benefit and can cause linking problems). If you experiment with LLVM’s lld linker or Link-Time Optimization (LTO) system-wide, those would also tie into LDFLAGS – but that’s a topic for another day (and a very experimental path on a large Gentoo system!).

Justification

All these compiler-related flags ensure that our system is tuned for the i7-14700K CPU and stable in daily use. By targeting Raptor Lake, we get support for advanced instruction sets (like AVX2 and FMA) which can significantly speed up math-heavy workloads. For instance, vectorized instructions can accelerate linear algebra operations that are ubiquitous in AI/ML code. We avoid overly risky optimizations (like -Ofast, which is even more aggressive than -O3 but breaks strict standards compliance and is unsafe globally). The goal is to get maximum performance while retaining stability – a failed overnight compile due to an extreme flag isn’t worth a tiny speedup in a matrix multiply. This balanced approach is generally recommended by Gentoo developers, who filter out unsafe flags in ebuilds for the same reasons.

Global USE Flags

The USE variable in make.conf sets global USE flags, which enable or disable optional features across packages. Tuning USE flags is one of Gentoo’s greatest strengths – you only build what you need. Here’s what we have and why:

cuda and opencl – These enable NVIDIA CUDA and OpenCL support globally. Since this machine is used for machine learning with a GPU (an NVIDIA card), I want any packages that can leverage GPU acceleration to do so. For example, enabling cuda will ensure that packages like sci-libs/opencv, sci-libs/boinc, or potentially deep learning libraries will build with CUDA support if available. The opencl flag similarly enables OpenCL support (useful for general GPU compute, and also for AMD GPUs or Intel’s oneAPI). By setting these globally, I don’t have to remember to turn them on for each relevant package – Portage will pull in the CUDA toolkit and any needed dependencies automatically when a package has a cuda USE flag.

openmp – This enables OpenMP support globally, which allows parallel CPU threading in many numeric libraries. AI/ML and scientific packages (like SciPy, OpenCV, BLAS libraries, etc.) often have OpenMP-based parallelism for multi-core scaling. With openmp in USE, packages that can use OpenMP will be built with support for it. This is key for making full use of the 20-core/28-thread CPU during heavy computations.

blas and lapack – These flags (and related ones like cblas) aren’t global USE flags per se (you won’t find a single “blas” USE in use.desc for everything), but I listed them here for clarity because we definitely want BLAS/LAPACK support in our scientific libraries. Gentoo handles BLAS/LAPACK via virtual packages (virtual/blas, virtual/lapack) and allows various implementations (reference BLAS, OpenBLAS, BLIS, Intel MKL, etc.). In our case, I want an optimized BLAS/LAPACK for AI workloads. There are two main choices for Intel CPUs:

OpenBLAS (open source, very fast on Intel for large matrices, and generally competitive with Intel’s MKL).

Intel MKL (closed-source, free-to-use runtime available as sci-libs/mkl-rt in Gentoo, often the fastest for Intel especially for smaller matrix sizes due to specialized optimizations like libxsmm).

I’ve chosen to use OpenBLAS on this system for now (to keep things fully open source and easy to manage). To enforce that, I set the openblas USE flag where applicable. For example, Gentoo’s deep learning package (sci-libs/caffe2, which is the basis of PyTorch) has an openblas USE flag. By enabling that (and not enabling mkl), I ensure it will use OpenBLAS as the BLAS provider. Similarly, some packages like OpenCV have USE flags mkl or openblas to choose the math backend. If I wanted MKL instead, I could swap and use mkl USE flag globally (and Gentoo would pull in MKL). The bottom line is to choose one high-performance BLAS/LAPACK and stick with it. OpenBLAS is a great default that yields excellent performance on Intel hardware without the licensing fuss of MKL. We also ensure linear algebra is threaded (OpenBLAS can be built with openmp or pthread threading – here it will pick up OpenMP threads because we enabled that globally).

Tip: Gentoo now supports BLAS/LAPACK runtime switching (if OpenBLAS is built with eselect-ldso USE). This means you could have multiple BLAS implementations installed and switch between them using eselect blas and eselect lapack without rebuilding packages. It’s handy for benchmarking MKL vs OpenBLAS, for example. On a production system, though, I usually just stick to one to avoid confusion.

xen – I enable the global xen USE flag because this machine is running the Xen hypervisor (as dom0). The xen USE flag triggers Xen support in various packages. For instance, some system tools and libraries have optional Xen integration. Enabling xen globally ensures that if a package can be built with Xen support, it will be. A concrete example is dev-lang/mono: it has a special xen flag that makes the Mono runtime generate code optimized for Xen virtual machines (which can make .NET code run faster under Xen at the slight expense of performance on bare metal). Since my system is always under Xen (dom0 is itself a VM in Xen terms, albeit a privileged one), this is beneficial. Another example is app-emulation/libvirt, which uses USE="xen" to enable the Xen driver support. Global Xen USE is a niche flag that most users won’t need, but in a Xen environment it’s very useful for pulling in all relevant support.

Minimal GUI – I have -gtk -qt5 -kde -gnome in USE to disable big desktop environments and toolkits on this machine. This box is meant for development and virtual machines, and I primarily use a minimal Xfce or console, so I don’t want builds accidentally pulling in GTK+ or Qt unless absolutely necessary. By masking these globally, I reduce bloat (for example, some packages have optional GUI tools – they won’t be built). This reflects the principle of only include what you need. (If you’re running a full desktop on Gentoo, you’d of course omit these negatives and enable the ones you use.)

To summarize, these global USE flags ensure machine learning friendliness (GPU computing and math libraries), virtualization awareness (Xen), and keep the system lean by avoiding unwanted features. Remember that USE flags can also be set per-package in /etc/portage/package.use for fine-tuning, but anything in make.conf is a convenient default for the whole system.

CPU_FLAGS_X86 – Leveraging CPU Instructions

While -march=raptorlake instructs the compiler to generate optimized code, Gentoo also uses CPU_FLAGS_X86 to enable hand-written assembly optimizations in certain packages. This variable (part of the USE_EXPAND system) lists specific CPU instruction set extensions present on your CPU. You can think of it as telling packages “here is exactly what my CPU supports.” For example, multimedia libraries (like FFmpeg, image codecs, cryptography libraries, etc.) often have hand-tuned assembly or intrinsics for SSE2, AVX, AES-NI, etc. They key off CPU_FLAGS_X86 to decide which optimizations to enable.

On my i7-14700K, I set:

CPU_FLAGS_X86="aes avx avx2 f16c fma3 mmx mmxext pclmul popcnt rdrand sha sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx"

That looks like alphabet soup, but each token is a feature bit of the CPU:

sse, sse2, sse3, ssse3, sse4_1, sse4_2 – Streaming SIMD Extensions (various generations). These are the foundational vector instruction sets. Virtually all modern software can use SSE2, but by listing them, we ensure packages that check for them get a “yes”. For example, image processing might use SSE2 for color space conversions, or video encoders might use SSE4.1 for certain filters.

avx, avx2 – Advanced Vector Extensions 256-bit. AVX2 is a big one for numerical work; it allows handling eight 32-bit floats or four 64-bit floats in one instruction (versus SSE which handles at most 128-bit vectors). Libraries like NumPy, OpenBLAS, and many others will use AVX/AVX2 kernels for significant speedups. By setting these flags, we let Gentoo packages enable AVX2 code paths. For instance, OpenBLAS will compile its optimized kernels for DGEMM (matrix multiply) with AVX/FMA if available. Many video encoders will use AVX2 for motion estimation, etc.

fma3 (FMA) – Fused Multiply-Add instructions (part of AVX2 on Intel Haswell and newer). This is great for linear algebra and neural network computations because it can do a = a + b*c in one step with reduced rounding error. We include this so any library that can use FMA will do so (e.g., Eigen, BLIS, and others detect it).

f16c – Half-precision (16-bit float) conversion instructions. Relevant in ML if using half precision floats for faster compute (common in GPU code, but even CPU libraries may use it for things like compressing float arrays or BF16 emulation). By advertising F16C, packages like OpenCV or TensorFlow (if compiled from source) can utilize it.

aes – AES-NI (AES encryption instructions). Not directly ML-related, but if you do any disk encryption or SSL, this drastically speeds up AES cipher operations. We include it so that openssl, cryptsetup, etc., use hardware AES. It’s a no-brainer on modern CPUs with AES-NI.

pclmul (PCLMULQDQ) – Polynomial Carry-less Multiply, used for fast Galois Field multiplications (often paired with AES for GCM mode encryption). Again, more crypto than ML, but it comes for free with the CPU and helps anything doing CRCs or GCM encryption.

popcnt – Population count instruction, counts bits set to 1. This is used in certain algorithms (e.g., neural network binary masks, cryptography, etc.).

rdrand – Hardware random number generator. Useful for quick entropy (e.g., seeding RNGs for Monte Carlo simulations in ML or cryptographic keys). Intel’s RDRAND is exposed via this flag.

sha – SHA1/SHA256 hashing instructions. This CPU can accelerate SHA hashing, which can be useful if you’re, say, checksumming lots of data or training models that involve hashing (some locality-sensitive hashing schemes, etc.). Mainly though, it helps cryptography and even verifying Gentoo distfiles faster.

mmx, mmxext – Old but still included for completeness. MMX is an old SIMD set; mmxext refers to AMD’s extended MMX instructions (Intel CPUs also support them as part of SSE). Many media codecs still have MMX fallbacks, so we advertise support.

vmx – This flag indicates Intel VT-x virtualization support (the name is a bit confusing; on PowerPC VMX meant something else, but in Gentoo’s CPU_FLAGS_X86 context, vmx means the CPU supports hardware virtualization extensions). On an Intel chip, vmx will be on if VT-x is enabled. Setting this doesn’t directly speed up normal software, but it can matter for packages like QEMU or virtualization libraries that can optionally use hardware acceleration. For example, QEMU with KVM will definitely use VT-x if available (and it is, on this CPU). Including vmx in CPU_FLAGS_X86 will ensure any package that checks for virtualization extension (perhaps some low-level util or even compression algos that might behave differently in VMs) knows it’s present.

To get this list, I actually used the tool cpuid2cpuflags (available in Gentoo). It reads /proc/cpuinfo and prints out the recommended CPU_FLAGS_X86 for your CPU. This is much easier than manually guessing. The flags above were generated (and I cross-checked them with what a similar 13th-gen CPU has reported). Notably absent are AVX-512 flags – the 14700K, like other recent Intel consumer CPUs, does not have AVX-512 (it’s based on an Alder Lake design which fused off AVX-512). So we don’t list any avx512f/dq/bw/etc. If you were on an older Intel X or Xeon that had AVX-512, you’d include those too.

Why this matters

Having CPU_FLAGS_X86 properly set means we get the full benefit of our CPU’s capabilities when using pre-built libraries. For example, if you emerge media-libs/x264 (the H.264 video encoder library), the ebuild will look at CPU_FLAGS_X86 and enable assembly optimizations accordingly – using SSE2, AVX, etc., to make encoding much faster. If we left this blank or incomplete, some packages might fall back to generic C code despite the CPU being capable of more. It’s an easy win for performance.

In summary: CFLAGS tune the compiler output, while CPU_FLAGS_X86 inform hand-written code paths in packages. We set both for maximum acceleration.

MAKEOPTS – Parallel Compilation

MAKEOPTS controls the number of parallel jobs used when compiling a single package. Our i7-14700K has 20 cores (8 Performance cores + 12 Efficient cores) and, thanks to HyperThreading on P-cores, 28 threads total. We want to take advantage of all that concurrency to speed up build times.

I set: MAKEOPTS="-j28". This tells GNU Make to spawn up to 28 compile processes in parallel. In practice, that means when Portage is building a large package (say, GCC or PyTorch C++ core), it will try to compile 28 source files simultaneously. Ideally, you’d match the number of threads or cores. Some people use a formula of cores*1.5 or cores+1 to try and eke out a bit more throughput if some threads are waiting on I/O. But since this system is pretty fast and to avoid over-committing, I simply match the thread count.

Memory considerations: Each compiler job uses memory, and some large C++ sources or templates can consume a lot per process. It’s wise to ensure you have enough RAM for -j28. A common guideline is to allocate about 2GB RAM per job for safety. For example, if you had 16GB RAM, you might not want to run 28 heavy jobs in parallel – you could exhaust memory or start swapping (which kills performance). In my case, the machine has 64GB, so 28 jobs (each potentially using a few hundred MB, with occasional spikes) is fine. If you find certain huge packages (like web browsers or libreoffice) eat too much RAM with full parallelism, you can reduce jobs for those specific builds – we’ll see how in the package.env section. The Gentoo Wiki suggests setting jobs = min(CPU threads, RAM in GB / 2) as a rule of thumb. With 64GB, RAM/2 = 32, and CPU threads = 28, so 28 is within that limit.

It’s also possible to append a -l (load average) limit in MAKEOPTS, e.g. MAKEOPTS="-j28 -l28" to tell it to keep the system load around 28. I didn’t explicitly do it here, but it can be useful on multi-user systems or laptops to avoid overload. On this dedicated build workstation, I just let it rip at -j28.

Tip:

If you see diminishing returns or thrashing with -j equal to your thread count (for example, if using spinning HDDs for I/O, or limited RAM), you can experiment with slightly lower values. There’s no one-size-fits-all; monitor CPU and memory during big emerges to fine-tune. But in general, MAKEOPTS is where you massively cut compile times by using your hardware’s parallelism.

EMERGE_DEFAULT_OPTS – Parallel Emerging and More

While MAKEOPTS handles parallelism within a single package build, Portage can also build multiple packages in parallel (since a world update usually has many packages to merge, some of which don’t depend on others). EMERGE_DEFAULT_OPTS in make.conf lets us set default options for the emerge command.

I have: EMERGE_DEFAULT_OPTS="--jobs=2 --load-average=28". This means Portage can launch up to 2 package merges at once, as long as the system load is below 28. In practice, if I do emerge -uDN @world, Portage might compile two separate packages concurrently (each of those will itself use up to 28 MAKEOPTS jobs). This can speed up large updates where, say, you have one big package and several small ones – the small one can compile on some cores while the big one is also going. The --load-average=28 ensures that if the system is already fully busy (load 28 corresponds roughly to 28 active threads), Portage won’t spawn more to avoid overload.

Setting --jobs (also called emerge parallel jobs) requires some care: if you choose a high number, you could have too many heavy builds at once (for instance, 4 packages each trying to use 28 threads – that’s effectively 112 threads contending, which will overwhelm a 28-thread CPU!). I chose 2 as a modest level of parallelism. Often one compile job will not use the full CPU constantly (there’s I/O wait, etc.), so a second one can fill in gaps. But I avoid going higher than 2 or 3 on this single machine. Gentoo’s documentation points out that you should use --jobs in combination with --load-average to prevent overscheduling. The value of load-average should roughly equal your core count; here 28 is used as that benchmark.

Other useful things one might put in EMERGE_DEFAULT_OPTS include:

--keep-going (so that if a package fails in a big update, Portage continues with others and reports the failure at end).

--with-bdeps=y (to ensure build-time dependencies are considered for world updates, usually default in new Portage).

--quiet-build=y (to reduce output noise; I personally like seeing the build logs, but some prefer quieter emerges).

In our example, I kept it focused on parallelization settings. These defaults save me from typing them every time I run emerge - particularly helpful when doing large updates or rebuilds.

Portage FEATURES and Tools

The FEATURES variable enables various Portage features to improve the build process. Here, I set FEATURES="ccache parallel-fetch".

ccache – This turns on caching of compilation results. If you rebuild the same package with the same code and flags, ccache can retrieve the compiled output from cache instead of recompiling. This can be a huge time saver if you often recompile certain packages (for example, during iterative development of a package, or if switching between USE flags on a big codebase frequently). I enable ccache to speed up repetitive builds; it’s generally safe and only costs some disk space for the cache. Note that first-time builds aren’t faster (they will populate the cache), but subsequent builds can be much quicker. For large C++ projects like TensorFlow or PyTorch (if you compile them more than once), ccache is a blessing. We do need to install dev-util/ccache and configure it (e.g., set the cache size) separately, but Portage will automatically use it with this flag on.

parallel-fetch – This allows Portage to download source distfiles in parallel with building. So while one package is compiling, Portage can already start fetching the tarballs for the next ones. This overlaps network I/O with CPU work and speeds up overall emerge times when multiple packages need fetching. In modern Portage, fetching in parallel might be on by default, but I include it to be sure. It basically spawns fetch jobs early (you might see output like “Fetching xyz.tar.xz” while another compile is running). This is especially useful after a long time between updates, where many source files need to be downloaded – you don’t want the CPU idle waiting for downloads serially.

There are many other FEATURES one could enable depending on needs:

distcc for distributed compiling across multiple machines (not used here, as I don’t have a distcc setup – and for a single very powerful machine, distcc might not help much unless you have comparable servers around).

sandbox, user-sandbox (these are on by default to protect the system during builds).

preserve-libs (also default, helps recover from library updates without breaking revdeps immediately).

splitdebug (to split debug symbols into separate files, used with FEATURES="splitdebug" and appropriate CFLAGS, if you want to keep debug info for profiling or debugging without bloating binaries at runtime).

network-sandbox (enabled by default nowadays; it prevents build processes from downloading anything on their own, to keep builds deterministic).

etc.

I’ve kept it simple: ccache and parallel-fetch are directly beneficial for a development/ML workload machine by reducing build times.

VIDEO_CARDS – This is actually a USE_EXPAND setting, not exactly FEATURES, but I included it in this section of make.conf. It informs Xorg driver and some packages what graphics drivers to build/support. I set VIDEO_CARDS="nvidia" to indicate I use an NVIDIA GPU (with the proprietary driver). This way, x11-base/xorg-drivers will pull in the appropriate x11-drivers/nvidia-drivers package. It also ensures any package that has optional support for NVIDIA (CUDA-related or OpenGL libraries) is aware of the target. If I had an AMD GPU, I’d set VIDEO_CARDS="amdgpu radeonsi", etc. For CUDA work, obviously NVIDIA is the choice, so this is just making sure the system knows to install NVIDIA drivers and related bits. (We must also have the NVIDIA CUDA Toolkit for GPU computing – Gentoo’s dev-util/nvidia-cuda-toolkit gets pulled in as needed by USE=cuda.)

GRUB_PLATFORMS – This is a bit niche: I include GRUB_PLATFORMS="efi-64 xen". This is relevant if you use GRUB as your bootloader and you want it to support Xen. The efi-64 part is for UEFI boot on 64-bit (standard for modern systems). Adding xen here means when we build GRUB, it will include Xen hypervisor support (specifically, the ability to chainload the Xen hypervisor). Without this, GRUB might not recognize Xen kernels properly. The Gentoo Wiki notes that to boot Xen directly via GRUB, you should enable xen platform support. After setting this, one would re-emerge sys-boot/grub so it compiles in Xen support. It’s a small thing, but important in a Xen host setup.

Kernel and Xen-Specific Flags

While most of make.conf deals with userland, there are a couple of settings relevant to the kernel and virtualization:

In the Xen context, when building the Xen hypervisor itself (app-emulation/xen and xen-tools packages), there are some special requirements. Notably, to include UEFI support in Xen, the build system requires a particular linker flag. Gentoo’s documentation says “Adds UEFI boot support, requires LDFLAG -melf_x86_64 for amd64” for the Xen package. In practice, this means we need to compile Xen with an extra linker flag to target the EFI binary format. Rather than put -melf_x86_64 globally in LDFLAGS (which could affect other builds), a smarter approach is to use a package-specific environment override. For example, I can create a file /etc/portage/env/app-emulation/xen-ldflags.conf with LDFLAGS="${LDFLAGS} -melf_x86_64", and then in /etc/portage/package.env assign the xen package to use that environment. This way, only Xen’s build gets that flag. This is a prime example of using package.env for a unique requirement. (If this sounds complex, the gist is: Xen needs a special link flag for UEFI; we supply it to avoid build errors or missing UEFI support).

QEMU targets – Since our system uses Xen, we also have QEMU in the mix (Xen can use QEMU for device emulation for HVM guests). Gentoo’s QEMU package can build support for many target architectures (arm, ppc, mips, etc.), but if you only plan to use x86 VMs, you can restrict this to save time and space. I set:

QEMU_SOFTMMU_TARGETS="x86_64"
QEMU_USER_TARGETS="x86_64"

This tells the QEMU ebuild to only build x86-64 emulation (and for user-space emulation, only x86-64). I don’t need QEMU to emulate ARM or others on this host. By default, if these are not set, QEMU might build a plethora of targets or none (depending on version – newer QEMU ebuilds might require you to opt-in to what you need). The Gentoo Wiki confirms that leaving these unset will disable most targets, and you enable the ones you want in make.conf. So we explicitly enable just our architecture. This significantly cuts QEMU’s compile time. If I needed 32-bit x86 support, I could add i386 to softmmu targets as well. For Xen’s usage (device model), x86_64 is sufficient. Additionally, I have USE flags like qemu and hvm enabled on xen-tools by default (Gentoo’s Xen meta-packages do that for us), which ensures the QEMU device model is included for HVM guests.

Virtualization tuning – The kernel for Xen dom0 is custom-built. There aren’t direct make.conf flags for kernel tuning (since the kernel isn’t built via Portage in my setup, I configure and compile it manually or with genkernel). However, it’s worth mentioning a couple of kernel config choices for this scenario:

The kernel should have Xen pvops support (enabled by selecting “Xen guest support” and “Xen backend driver support” for dom0). This is outside make.conf, but crucial.

If using GCC to compile the kernel, one could experiment with setting KCFLAGS for the kernel build to include things like -march=raptorlake as well. The kernel ebuilds by default use a safe optimization level and march. In my case, I compile the kernel manually and I do pass -O2 -march=native to it for consistency. It hasn’t caused issues – but one must be careful as certain kernel code doesn’t like aggressive flags. Generally stick to -O2 for kernel unless you know what you’re doing.

The Xen hypervisor itself is a small specialized kernel – it doesn’t benefit from crazy optimizations, so I compile it with defaults (plus the required flags for UEFI). It’s more important that Xen is stable than super optimized.

In summary, the make.conf is augmented with a few virtualization-specific tweaks: enabling Xen support in all relevant places (USE flags, GRUB, QEMU targets) and handling Xen’s special build needs. This ensures that running Xen as a hypervisor on this Gentoo system is smooth and fully supported.

AI/ML Workload Considerations

Finally, let’s discuss how this configuration benefits AI/ML development specifically, and note any other recommendations for such workloads:

Math Libraries and Python Science Stack: We already covered enabling BLAS/LAPACK (OpenBLAS in our case). For machine learning, libraries like NumPy, SciPy, and Pandas will automatically use the chosen BLAS/LAPACK for operations (Gentoo’s numpy can be built with OpenBLAS vs MKL via USE flags, and SciPy just links to the virtual/blas provider). It’s a good idea to also install sci-libs/oneDNN (oneAPI Deep Neural Network library, formerly MKL-DNN) which some frameworks like TensorFlow can utilize for optimized primitives. Gentoo has sci-ml/oneDNN; it can be pulled in via USE flags (e.g., the onednn USE in sci-libs/caffe2 (PyTorch)). In make.conf, we set nothing special for this except ensuring the global USE is not blocking it. I left onednn at package default (which for PyTorch might be on by default, as indicated by the +onednn in package listing).

Deep Learning Frameworks (TensorFlow/PyTorch): Gentoo does package PyTorch (as sci-libs/caffe2) and TensorFlow (sci-libs/tensorflow in some overlays or by user). Compiling these from source is very resource-intensive. However, if you choose to do so, our configuration prepares us well:

We have cuda globally, so PyTorch will build with CUDA support (this brings in a bunch of dependencies like CuDNN, NCCL, etc., if available). Gentoo’s PyTorch ebuild has a cuda global flag and also local USEs for cudnn, fbgemm, tensorrt (if you have those). Make sure to accept the licenses for NVIDIA stuff if needed.

We enabled openblas and openmp for PyTorch, which means it will prefer OpenBLAS for its CPU tensor operations and use OpenMP for multithreading (alternatively, one could use MKL by USE="mkl", which would route PyTorch to MKL BLAS; we went with OpenBLAS as discussed).

The CPU flags ensure that if PyTorch or TensorFlow use JIT code generation or vectorized ops (XLA, etc.), they’ll know AVX2 and FMA are available. For instance, TensorFlow will attempt to use AVX instructions in its binaries if the host supports it.

One might consider adding -DTENSORFLOW_USE_MKL=1 or similar if building TensorFlow from source with MKL; with OpenBLAS, just having it as the default BLAS is enough for it to use it. Honestly, for TensorFlow, I often use the official pre-built wheels (since building TF can take many hours even on this beefy machine). But if you do build, ensure you’ve got enough swap/RAM and maybe limit MAKEOPTS for it if needed (TensorFlow’s build in bazel can spawn a lot of processes – sometimes limiting jobs to 16 for it via package.env can prevent OOM).

CUDA Toolkit and GPU drivers: Because of USE=cuda and VIDEO_CARDS=nvidia, the necessary pieces are in place. The nvidia-cuda-toolkit provides nvcc (the CUDA compiler) and GPU-accelerated libraries, and nvidia-drivers provides the kernel module and OpenGL libraries. If you plan on doing development with CUDA, having CUDA_PATH set in your environment and perhaps adding CUDA’s bin and lib paths in /etc/env.d is useful, but not strictly make.conf material. Also, enable the cudnn USE flag for any libraries that can use NVIDIA’s cuDNN (NVIDIA’s optimized GPU neural network library), and install sci-libs/cudnn (which is a binary blob download, but Gentoo has an ebuild for it). For example, USE="cudnn" on PyTorch will make it use cuDNN for convolution operations.

Parallelism in ML: We already ensure OpenMP is on, which many libs will use for CPU threading. Another consideration is whether to use MPI for multi-node training. If this machine is part of a cluster, enabling mpi USE globally might be desirable, so that packages like OpenBLAS or PyTorch build MPI support (Gentoo’s PyTorch ebuild has mpi USE to enable distributed training). In our make.conf, mpi isn’t enabled because this is a single-node setup. But one could easily add it.

Profiling and Debugging: For an advanced user, sometimes you want to profile ML code or debug numerical issues. You might consider enabling symbols for certain libraries. Rather than doing it globally (which would make everything huge), you can use package.env to enable debug symbols on specific packages. For instance, if I want to debug media-gfx/gimp (as in the wiki example) or say dev-lang/python for a problem, I could create an env like debug.conf with CFLAGS+=" -g" and FEATURES="splitdebug" and apply it to that package in package.env. This way, I get a debug-enabled build of just that component. The example in the make.conf above doesn’t have this, but I mention it as a technique relevant to development.

Storage and TMPDIR: Machine learning frameworks can produce very large temporary files (for example, during compilation of TensorFlow, or when linking big libraries). As a tip, if you have lots of RAM and an NVMe SSD, consider using a tmpfs for Portage’s build directory (PORTAGE_TMPDIR). In my config above, I did not change PORTAGE_TMPDIR (it defaults to /var/tmp/portage). But you could mount /var/tmp/portage as a tmpfs to speed up compilation I/O. The wiki notes that building in RAM can accelerate builds significantly. Just ensure you have enough memory + swap, because large packages can require multiple tens of GB of space to compile. Alternatively, what I do is mount /var/tmp/portage on a fast NVMe drive and use package.env to put extremely large packages on disk instead of tmpfs if needed (the wiki example shows how to override PORTAGE_TMPDIR for certain big packages that might not fit in tmpfs). This is an advanced trick for optimizing build times.

In essence, the make.conf we crafted is already oriented toward heavy computation and development. The combination of optimized compilation flags, tuned USE flags for math libs and GPU, and portage parallelization means the system is primed to install and run machine learning workloads efficiently.

Package.env Overrides – Fine Tuning per Package

Throughout this discussion, I mentioned per-package overrides a few times. This is a powerful mechanism Gentoo provides: you can create custom environment settings for specific packages in /etc/portage/package.env. Since our topic is make.conf, I won’t delve too deep, but I want to illustrate a couple of cases where you might use it in conjunction with our make.conf settings:

Big Packages with High RAM Usage: We set MAKEOPTS for aggressive parallelism. If you encounter a package that fails to build or causes the machine to thrash due to memory (e.g., www-client/firefox or dev-qt/qtwebengine are notorious), you can reduce jobs for just that package. For example, as the Gentoo Wiki suggests, create a file named makeopts-jobs-16.conf with MAKEOPTS="-j16" and then in package.env add a line like www-client/firefox makeopts-jobs-16.conf. This will ensure Firefox only uses 16 threads. The rest of the system still uses 28. In our scenario, maybe building sci-libs/tensorflow might require such a cap if 28 jobs is too much for bazel – you could limit it via a similar approach.

Mixing O2/O3 per package: If you wanted to build, say, OpenBLAS or BLIS with -O3 to eke out every drop of performance, you could do that without risking the whole system. For example, echo 'sci-libs/openblas custom-cflags.conf' >> /etc/portage/package.env where custom-cflags.conf contains CFLAGS="${CFLAGS} -O3" (or even more specific flags). Gentoo devs try to ensure such libs work fine with O3, but you would be assuming the risk. Still, it’s nice to have the option. The Reddit link in the search results and Gentoo forums have users discussing this kind of micro-optimization, but it’s truly for the tweakers. In general, our -O2 global and good CPU flags will do the job.

Different BLAS for certain packages: Suppose you generally use OpenBLAS, but you have one particular scientific app that you want to compare MKL with, you could install MKL and set that app to use MKL by enabling mkl USE just for it in package.use and perhaps altering environment variables. This isn’t exactly package.env, but a similar concept of per-package customization. Gentoo’s flexibility allows such mixes, though it can get complex.

Our example make.conf doesn’t explicitly use package.env, but it’s the safety valve that lets us push global optimizations (like -march=native, -j28, etc.) knowing we can dial things back for specific problematic packages if needed. It’s a feature for advanced users that complements an aggressive make.conf nicely.

Conclusion

The configuration presented – from CFLAGS to USE flags to Portage options – is tailored for a high-performance Gentoo Linux system doing AI/ML development on bare metal with a Xen hypervisor. Each line in the make.conf has been chosen with a rationale:

Performance: Use all available CPU features (e.g. AVX2, FMA) and parallelize builds to compile and run code faster.

Relevance: Enable support for technologies actually used (CUDA, OpenMP, Xen, OpenBLAS) and disable stuff that isn’t (unneeded GUIs) to avoid bloat.

Stability: Stay within proven safe optimizations globally (e.g. -O2) and use Gentoo’s mechanisms (feature flags, package.env) to work around any corner cases.

Maintainability: Many choices align with Gentoo defaults or documented best practices, making the system easier to upgrade. We’re not doing anything too crazy like global LTO or -Ofast which would put us in undocumented territory – instead it’s incremental improvements building on Gentoo’s sane defaults.

By understanding and explaining each setting, I hope this demystifies the make.conf for intermediate users and provides new ideas for advanced users. Tuning Gentoo is an ongoing, iterative process – you might adjust things as hardware or workload changes. But with the above as a starting point, you have a robust foundation for a Gentoo install that’s optimized for modern CPUs, ready for machine learning workloads, and aware of its role as a Xen hypervisor host. Happy compiling, and may your emerge times be ever short!

Sources:

Gentoo Wiki – GCC Optimization (recommendations on -O2 vs -O3, -march, etc.)

Gentoo Wiki – make.conf (documentation of variables like MAKEOPTS, EMERGE_DEFAULT_OPTS)

Gentoo Wiki – BLAS and LAPACK Providers (overview of BLAS implementations on Gentoo)

Gentoo Wiki – Xen Guide (USE flags and build requirements for Xen hypervisor on Gentoo)

Gentoo Wiki – QEMU (using QEMU_TARGETS to limit built targets)

Gentoo Wiki – Package.env Usage (examples of per-package MAKEOPTS and debug flags)

GCC 13 Release Notes – (confirmation of -march=raptorlake support in GCC for Raptor Lake CPUs)

Intel Developer Article – (notes on GCC 13 adding Raptor Lake and AVX-VNNI support)

Hacker News discussion – (performance of OpenBLAS vs MKL on Intel, and BLAS in ML like libxsmm for small matrices)

Gentoo Packages Database – (USE flags for PyTorch/Caffe2, showing cuda, openblas, mkl, etc.)

Complete make.conf Example