2012-06-14

architectures that prevent freezing #mac

5.9: sci.cyb/mac/architectures that prevent freezing:
to: cocoa-dev@lists.apple.com
. in a pre-emptive OS there should be no freezing;
given the new concurrency model
that includes the use of the graphics processor GPU
to do the system's non-graphics processing,
my current guess is that the freezes happen when
something goes wrong in the GPU,
and the CPU is just waiting forever .
. the CPU needs to have some way of getting control back,
and sending an exception message to
any of the processes that were affected by the hung-up GPU .
. could any of Apple's developers
correct this theory or comment on it ?

5.11: co.cocoa-dev:
[replies are nicely formatted at cocoabuilder.com]

Jens Alfke @mooseyard.com May 11, 2012 at 9:05 AM
> . in a pre-emptive OS there should be no freezing;
> given the new concurrency model
> that includes the use of the graphics processor GPU
> to do the system's non-graphics processing,

Well, the GPU can _occasionally_ be used to
do some non-graphics work,
typically tasks that are highly parallelizable.
I’d reckon this happens most often in games,
less in general purpose software.

> my current guess is that the freezes happen when
> something goes wrong in the GPU,
> and the CPU is just waiting forever .
> . the CPU needs to have some way of getting control back,
> and sending an exception message to
> any of the processes that were affected by the hung-up GPU .
> . could any of Apple's developers
> correct this theory or comment on it ?

OS freezes tend to happen when kernel-level code
gets into an infinite loop or deadlock.
Sure there “should be no freezing”
but there should be no bugs either, and that’s never true.
(It’s exacerbated by the fact that
some 3rd party device drivers
need to run in kernel space.)

_Some_ system freezes are due to
the GPU completely locking up,
usually due to a bug in the GPU vendor’s driver.
My understanding is that when this happens
it’s not really possible for the GPU to recover
without a system reset.
The CPU is probably still OK,
but that doesn’t do any good if it
can’t access the display.
[. but I'm losing the keyboard too(... I think)  .]
Jean-Daniel Dupas @shadowlab.org 9:23 AM
While playing with GPU programming,
I had a lot of such freeze,
and they never locked the CPU.
I was always able to connect to my machine though SSH.
Killing the processes that are affected is not enough.
You may have to reinitialize the GPU driver.
I think this is something Windows is able to do,
but not Mac OS X AFAIK.
More info about how it works on Windows
can be found here:
http://msdn.microsoft.com/...hardware/...
Jens Alfke  10:55 AM
On May 11, 2012, at 9:23 AM, Jean-Daniel Dupas wrote:
> I was always able to connect to my machine though SSH.

So a regular user process can
permanently lock up the display, requiring a reboot,
just by executing some bad GPU code?!
That’s kind of a bad privilege violation
and could be considered a DoS exploit.
Jean-Daniel Dupas 12:45 PM
It was on 10.6 and never test if the system was
better at handling lack of VRAM on 10.7
(caused by like texture leak for example).
That said, if you want a DoS exploit,
just send a shutdown Apple Event to the system,
and it will stop the machine without asking anything.

osascript -e 'tell app "System Events" to shut down'
-- Jean-Daniel
Charles Srstka @charlessoft.com 1:40 PM
On May 11, 2012, at 12:55 PM, Jens Alfke wrote:
> So a regular user process can permanently lock up the display,
> requiring a reboot, just by executing some bad GPU code?!

On the old 2008 non-unibody MacBook Pro
that I used to have
with the NVidia 8600M GT in it,
you could do that just by executing some *good* GPU code.
Wade Tregaskis  2:49 PM [ref]
>> While playing with GPU programming,
>> I had a lot of such freeze, and they never locked the CPU.
>> I was always able to connect to my machine though SSH.

Sometimes you can, sometimes you can't.
It depends on exactly how things fail.

> So a regular user process can permanently
> lock up the display, requiring a reboot,
> just by executing some bad GPU code?!
> That’s kind of a bad privilege violation
> and could be considered a DoS exploit.

Yes, it is.  A particularly serious one.
It's been a pet peeve of mine for years
that this is knowingly ignored
by those who should know better.

Last I checked, there was a
watchdog mechanism on the GPU
that would fire after some time
(7 seconds on nVidia GPUs).
That signals the driver (on the CPU)
that it's doing a watchdog reset.
Unfortunately, the drivers don't really handle that.
They could - and *should* -
reinitialise and recover,
but it's just not implemented
(or at least wasn't, a year or two ago).

[[ There's probably also timeouts implemented
in the driver and various other layers,
though I don't know the details.]]

If you've ever done any CUDA work
you'll be all too familiar with this problem.
Much of nVidia's own example code
will trigger this failure mode,
and most require a reboot to recover from.

In general GPUs are in a comparative stone-age
when it comes to security and stability.
They're getting better
- retracing the steps of CPUs thirty years ago
while thinking they're being very clever,
in a cutely naive way -
but it'll probably be many years
before these problems are properly resolved.

AMD & Intel are significantly ahead of nVidia
in this regard, I hear.
But personally, after having every single
nVidia-packing machine I've ever owned
die (sometimes repeatedly) due to
GPU-related hardware faults,
I'd never buy an nVidia based machine to begin with.
But I digress...  back to trying to
recover data from my 8800GS iMac...  again...
5.9: web: preemption is relative:

. a user@macosx.com is asking
why the mac doesn't seem in control
if it really is a preemptive system;
and, wiki's preemption page was helpful:

preemption is relative to {user, supervisor} mode:
. In any given system design,
some operations performed by the system
may not be preemptible.
This usually applies to kernel functions
and service interrupts which,
if not permitted to run to completion,
would tend to produce race conditions
resulting in deadlock.
Barring the scheduler from preempting tasks
while they are processing kernel functions
simplifies the kernel design
at the expense of system responsiveness.
The distinction between user mode and kernel mode,
which determines privilege level within the system,
may also be used to distinguish whether
a task is currently preemptible.
Some modern systems have preemptive kernels,
designed to permit tasks to be preempted
even when in kernel mode.
Examples of such systems are Solaris 2.0/SunOS 5.0[1],
Windows NT, the Linux kernel 2.6 and 3.x, AIX
and some BSD systems (NetBSD, since version 5).
Other systems improve responsiveness
by a microkernel design.
This moves most of the system logic out of the kernel
and into user mode processes, which are preemptible.
how linux differs from unix:
(some interesting discussion here about preemption).

5.8: sci.cyb/mac/architectures that prevent freezing:
 to: discussions.apple.com
. in a pre-emptive OS there should be no freezing;
given the new concurrency model
that includes the use of the graphics processor GPU
to do the system's non-graphics processing,
my current guess is that the freezes happen when
something goes wrong in the GPU,
and the CPU is just waiting forever .
. the CPU needs to have some way of getting control back,
and sending an exception message to
any of the processes that were affected by the hung-up GPU .
. could anybody at discussions.apple.com
correct this theory or comment on it ?

[... many replies trying to be helpful .]