As of Linux 6.7 I’m getting hard freezes that require a power cut to reset (sysrq doesn’t work.) Happens at both idle and load anywhere from 5 minutes in to an hour. Running journalctl --follow and dmesg -w (both as root) reveal nothing at the time of the crash. Kernel version 6.6 continues to be 100% stable.

System:

  • Distro/Kernel: Arch Linux 6.7.arch3-1
  • CPU: AMD Ryzen 5 2600X
  • GPU: AMD RX580 8GB via AMDGPU
  • RAM: Some configuration of 16GB at 2667 MT/s.
  • WM: SwayWM

I’m unsure how to go about properly reporting a bug if no errors are being generated.

Any advice?

I’m not alone on this apparently (warning, it’s reddit.)

  • mvirts@lemmy.world
    link
    fedilink
    arrow-up
    4
    arrow-down
    1
    ·
    8 months ago

    I’ve never done it, but I would try reproducing this in a VM like qemu… I would be googling at this point but I think you can debug a kernel crash from there somehow.

      • Corngood@lemmy.ml
        link
        fedilink
        arrow-up
        4
        ·
        8 months ago

        I did this recently and it was extremely quick to bisect and debug, but I was lucky enough to have a simple repro that worked in the emulator.

        I think if I were you I’d try to repro on bleeding edge first. Then if it’s still broken, I’d try to get the repro time down as much as possible and automate it. Then I’d either bisect on qemu if possible, or bare metal.

        • 0x0@social.rocketsfall.netOP
          link
          fedilink
          arrow-up
          2
          ·
          8 months ago

          Yeah, the qemu idea was brought up earlier in the thread and it’s very interesting. Glad you confirmed you could repro real issues there in the test environment, so it’s at least a little likely I’ll be able to do the same. Makes sense that it would work and is way better than letting the real system crash and burn. My kernel compile time is pretty short so it shouldn’t be too bad to bisect, I’m just not sure how many commits separate my stable kernel from the bugged 6.7. TBH I’m not that familiar with kernel dev., so maybe it’s way simpler than that.

          • Corngood@lemmy.ml
            link
            fedilink
            arrow-up
            3
            ·
            8 months ago

            The one I was able to test on qemu was a reliable failure of memory management syscalls triggered by a certain usage pattern. Unfortunately yours sounds like it’s probably hardware dependent. People in that Reddit thread mentioned video decoding, so you could try hammering that.

            The nice thing about bisecting is that it’s mostly logarithmic, so doubling the commits should only take one extra step. I’d be surprised if you had to do more than a 10-12 steps.

            You may already have a good kernel config, but for this sort of thing I usually use make localmodconfig. That’ll build all the modules that are loaded when you run it, which can cut down on compile time massively.

            • 0x0@social.rocketsfall.netOP
              link
              fedilink
              arrow-up
              2
              ·
              8 months ago

              I’m fresh off ruling out the RAM via memtest. I’ll let it do a longer soak overnight to see if anything fails then, but I’m now on to bisecting the kernel from what I believe is the last release of 6.6 (6.6.13) to hopefully whatever the offending commit is. Been a while since I’ve had to mess around with manually building the kernel without the aid of linux-tkg, but I’m off to learn it anyway. Thanks for the help!

              • Corngood@lemmy.ml
                link
                fedilink
                arrow-up
                2
                ·
                8 months ago

                Good luck! Sounds like you got it under control, but I’m happy to help if you run into trouble. I’m curious what you’ll find.