i'm happy to help anyone stuck on this confusing mess I've written. I do actually like doing it. unfortunately I'm not happy to read replies to stuff I post generally. Don't ask - but do ask me for help. IDK how it works on github but if I see a subject line like "can you help me with x" rather than: "A reply to your issue was posted by-" I'm less likely to start hyperventilating and I might even read it.
prelude
step -1: follow tutorial in README.md instead - that's the frontpage of the git repo (for any github repo). notice that this doesn't let you edit the repo. you'll have to keep your own copy on hand.step 0: do not use windows. do not use wsl... mac would be fine if it supported CUDA!. WSL (before the linux inside) takes longer to install than linux on a free drive does. WSL makes linux slow. stable-fast is... self-describing.
how to read this:
type the stuff after $ in to `bash`. Bash is the default shell in everything. if you see a # you're running as root and -you must not continue-. --you will never need sudo.--
not all of these should not be entered verbatim. This is monkey-see-monkey-do, not monkey-copy-monkey-paste. if there's a comment (##), pay attention. brackets, long weird directory paths etc are also a sign that you need to change something.
activate ($ source
) the relevant venv variables (i.e if you're using sfast with sd.next, source from automatic/venv. then install the setup tools, then git clone (download) stable-fast's source code.
## something like: "source [automatic]/[a]v[irtual]env[ironment]/bin/activate" such as:
automatic]$ source venv/bin/activate
$ pip install -U pip wheel ninja setuptools
$ mkdir repositories ## make a 'folder' to isolate the names of this git repo from python. (you'll confuse `import`)
$ cd repositories/ # or whatever directory you made - or existing one you found
$ git clone --recursive https://github.com/chengzeyi/stable-fast ## if recursive doesn't work, we try again next section
nano is a text editor I use to apply the patch I describe in the sfast issue you clicked to get here. it's ubiquitous and simple, but if you have any other (KDE Plasma has kate. We Like Kate. Kate is better than notepad++ on windows btw.) just enter that in place of nano
. you can edit xformers_attention.py however you like and ignore this, nano is just garuanteed to work.
how to nano ctrl-O (write Out to save (you can change the filename and save a backup before editing, but git restore {file} will get it back too). ctrl-X to eXit. arrow keys. delete, backspace, return all behave in the obvious way.* (if you don't have nano or any other choice google how to get it. Arch: sudo pacman -Syu nano
)
Or don't use the terminal at all, find the file in your file browser and double click away. Just do it after you update the repo and before you compile/install (without -e, that is. if you use -e you can apply this without reinstalling.)
steps 2 and 3:
$ cd ./stable-fast/
$ git pull --recurse-submodules #the recursive stuff is about the other git repos this one depends on. you can see in third-party. this probably won't do anything
$ nano src/sfast/libs/xformers/xformers_attention.py ##MAKE THE EDITS AT THIS POINT
$ TORCH_CUDA_ARCH_LIST=8.6 pip install --no-build-isolation -vvve . ##more typically -v or -ve or -e or alternatively, the bare minimum, from the env root:
$ MAX_JOBS=1 TORCH_CUDA_ARCH_LIST= pip install -v repositories/stable-fast/ #slow, reliable. but go nuts and set it to 2 for half the wait.
use the -e switch and instead of a wheel pip will install a link to your git tree, so edits you make to the source (as long as it's not the C or CUDA) will apply to the environment without having to run install again. That being said: install ccache (dark magic be warned) and use --no-build-isolation (highly recommended) to make rebuilds quick anyway.
If you have an Ampere 3xxx (3090) series or Axx (A80) series card with a GAXXX (GA102) GPU: you are done. If you ran the MAX_JOBX=1 version: you are done, no matter what GPU you have. Otherwise I lied and you need to change or unset TORCH_CUDA_ARCH_LIST
. .._LIST=""
is the most reliable choice if you survive compilation. Read on.
summary of troubleshooting:
]$ TORCH_CUDA_ARCH_LIST= MAX_JOBS=1 pip install --no-build-isolation -vvv repositories/stable-fast/
try without the build isolation flag if you like.
(add up to 3 -v
erbose switches to make the installer output more useful. -v seems fine. You will see errors about using deprecated build methods - ignore them, you aren't calling them directly, verbose is just revealing what pip install does. they'd be hidden without verbose. I think.)
i'm running out of memory/freezing trying to compile (pip install from a repo). OR: I'm getting an error trying to run stable-fast. something about CUDA architechture or codegen or similar (I have never seen this myself sorry!)
You need to constrain memory usage with MAX_JOBS= (default seems to be number of threads offered by your CPU. probably 2x your cores) and constrain memory usage by reducing the number of TORCH_CUDA_ARCH_LIST= entries you're compiling for. The former lets you cut memory usage at the expense of speed (of the build, not of sfast). the latter lets youcompile for just one specific GPU type, at the risk of being wrong and at the benefit of both speed and memory (the LIST is multiplied per JOB
or TLDR: https://developer.nvidia.com/cuda-gpus and maybe add +PTX if your GPU might be too cool for me to comprehend.
Replace ...LIST=8.6
with 8.0+PTX
if you're not using an Ampere or newer architecture gpu (GA1XX, A/H100, RTX 3XXX).
You do not have arch[itechture] 8.7
if you don't know what a "Jetson" is.
Not convinced that there's much difference but if you have a H100 specifically, you should give it to me use 9.0
if that still doesn't work consult this list or skip to next paragaph. It's very suspicious that the 4090 Ti has an API level (8.9) less than Miss Grace Hopper (9.0), despite being newer. Also a 3090 can run all of the headline hopper features (accelerated FP8, terabit etc). You are not missing out.
paranoiacs such as myself use 8.6+PTX. PTX adds suboptimal forward compatability (it's a higher level bytecode that can be compiled very quickly just as it's needed).
TORCH_CUDA_ARCH_LIST="" (or just being sure it hasn't been set) selects every number. if you have plenty of RAM or just a smart swap space setup (compiling isn't very swappy tbh) this may just work. if it does, don't bother fiddling because you definitely have the right binaries built in there, somewhere.