Lora-Script 摘要 主要是为了记录自己再模型微调的过程中遇到的一些问题,和一些参数配置 ,显卡为云端租赁使用
多卡模型训练主要是使用koyha_ss的框架修改,并使用deepspeed3进行多卡训练
单卡则是使用的aki的工具包进行lora-script的训练
bash config 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 [model] v2 = false v_parameterization = false pretrained_model_name_or_path = "./sd-models/realismEngineSDXL_v30VAE.safetensors" vae = "./sd-models/sdxl_vae.safetensors" [dataset] train_data_dir = "./train/001" reg_data_dir = "" prior_loss_weight = 1 cache_latents = true shuffle_caption = true enable_bucket = true [additional_network] network_dim = 32 network_alpha = 16 network_train_unet_only = false network_train_text_encoder_only = false network_module = "networks.lora" network_args = [][optimizer] unet_lr = 1 e-4 text_encoder_lr = 1 e-5 optimizer_type = "AdamW8bit" lr_scheduler = "cosine_with_restarts" lr_warmup_steps = 0 lr_restart_cycles = 1 [training] resolution = "512,512" train_batch_size = 1 max_train_epochs = 10 noise_offset = 0.0 keep_tokens = 0 xformers = true lowram = false clip_skip = 2 mixed_precision = "fp16" save_precision = "fp16" [sample_prompt] sample_sampler = "euler_a" sample_every_n_epochs = 1 [saving] output_name = "xtgz-centos-sdxl" save_every_n_epochs = 2 save_n_epoch_ratio = 0 save_last_n_epochs = 499 save_state = false save_model_as = "safetensors" output_dir = "./output" logging_dir = "./logs" log_prefix = "output_name" [others] min_bucket_reso = 256 max_bucket_reso = 1024 caption_extension = ".txt" max_token_length = 225 seed = 1337
常见问题 镜像站地址
模型缺失
pip权限问题
文件夹权限问题
WARNING: Ignoring invalid distribution -orch (/root/miniconda3/lib/python3.10/site-packages) delete floder path ‘.~orch’ or other same sytle
torchvision问题
cuda 12.8 安装 - 下载地址: `wget https://developer.download.nvidia.com/compute/cuda/12.8.0/local_installers/cuda_12.8.0_570.86.10_linux.run`
- 静默安装 `sh cuda_12.8.0_570.86.10_linux.run \ --toolkit \ --toolkitpath=/root/autodl-tmp/cuda-12.8 \ --silent`
- 修改环境变量
- `echo 'export PATH=/root/autodl-tmp/cuda-12.8/bin:$PATH' >> ~/.bashrc`
- `echo 'export LD_LIBRARY_PATH=/root/autodl-tmp/cuda-12.8/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc`
CUDA error (/__w/xformers/xformers/third_party/flash-attention/hopper/flash_fwd_launch_template.h:175): no kernel image is available for execution on the device Traceback (most recent call last):
修复这个问题,首先需要检查cuda:nvcc --version -> nvidia-smi 中的cuda版本对应
如果不一致则使用上面的cuda安装新版本
conda list 检查pytorch版本与xforemers版本是否一致,如果安装了2.7.1但是最高支持到2.7.0,可以首先降级使用 pip3 install torch==2.7.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
依旧报错则确实为xformers的问题
运行python -m xformers.info
查看里面的build.envs版本,有可能12的算力显卡下载的包算力兼容最高到9.0
build.env.TORCH_CUDA_ARCH_LIST: 6.0+PTX 7.0 7.5 8.0+PTX 9.0a
确认当前显卡的算力nvidia-smi --query-gpu=compute_cap --format=csv 12.0
修改环境变量中的算力值: export TORCH_CUDA_ARCH_LIST="12.0" 单次修改
下载源码并编译,可以使用镜像网站加速下载pip install -v --no-build-isolation -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers->pip install -v --no-build-isolation -U git+https:/ghfast.top/https://github.com/facebookresearch/xformers.git@main#egg=xformers
安装完成后,检查一下xformers的build版本,如果与自己算力一致则正确
- 如果直接点击install-cn.ps1 会出现无法安装虚拟环境的问题,可以切换到 install.ps1进行安装,网络问题主要体现在torch安装,国内版本没有明显改善
- 安装
- 如果网速过慢的情况下,需要重新安装torch,可以参照下面的步骤进行
- 打开install.ps1,手动复制下面的命令
- `python.exe -m venv venv` 创建虚拟环境
- 激活虚拟环境`.\venv\Scripts\activate`
- 使用`nvidia-smi`找到自己的cuda版本,去[torch官网](https://pytorch.org/get-started/locally/)找到相应的`.whl`文件手动下载
- ![[Pasted image 20251117091728.png]]
- 前往xformers官网,找到安装的命令行粘贴到浏览器手动下载,我是cuda128所以使用cuda128的安装命令
- `pip3 install -U xformers --index-url https://download.pytorch.org/whl/cu128`
- 将下载的两个包放在lora-scropt的文件夹中![[Pasted image 20251117092155.png]]
- 使用命令手动安装这两个包,先安装torch,在安装xformers
- ` pip install .\torch-2.7.0+cu128-cp310-cp310-win_amd64.whl`
- `pip install .\xformers-0.0.30-cp310-cp310-win_amd64.whl`
- 其次再在ps文件中更新一下环境文件,即可顺利使用![[Pasted image 20251117092327.png]]
Flux 训练出现需要下载google/t5-xxl 的报错 - [https://blog.csdn.net/sinat_29957455/article/details/142782264](https://blog.csdn.net/sinat_29957455/article/details/142782264)
多卡训练的问题汇总 完整的训练参数 tran_flux.sh 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 export NCCL_IB_DISABLE=1export CUDA_LAUNCH_BLOCKING=1export CUDA_VISIBLE_DEVICES=0,1,2,3export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:TrueMODEL_PATH="train/sd-models/flux1-dev.safetensors" CLIP_PATH="train/sd-models/clip_l.safetensors" T5_PATH="train/sd-models/t5xxl_fp16.safetensors" AE_PATH="train/sd-models/flux-ae.safetensors" OUTPUT_DIR="./output" accelerate launch \ --deepspeed_config_file "ds_config.json" \ --use_deepspeed \ --num_cpu_threads_per_process 8 \ --gpu_ids 0,1,2,3 \ --mixed_precision bf16 \ --num_processes 4 \ --num_machines 1 \ --num_cpu_threads_per_process 1 \ --offload_optimizer_device cpu \ --offload_param_device cpu \ "sd-scripts/flux_train.py" \ --config_file "dreambooth_flux_config.toml" \ --optimizer_type="adafactor" \ --optimizer_args "scale_parameter=False" "relative_step=False" "warmup_init=False" \ --cache_text_encoder_outputs \ --cache_latents \ --full_bf16 \ --lowram \ --gradient_checkpointing \ --cache_latents \ --max_data_loader_n_workers 0 \ --learning_rate 1e-5 \ --cache_latents_to_disk \ --cache_text_encoder_outputs_to_disk
ds_config.json 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 { "train_batch_size" : "auto" , "train_micro_batch_size_per_gpu" : 1 , "gradient_accumulation_steps" : "auto" , "steps_per_print" : 1 , "zero_optimization" : { "stage" : 3 , "offload_optimizer" : { "device" : "nvme" , "nvme_path" : "/root/autodl-tmp/ds_cache" , "pin_memory" : true , "buffer_count" : 5 , "fast_init" : false } , "offload_param" : { "device" : "nvme" , "nvme_path" : "/root/autodl-tmp/ds_cache" , "pin_memory" : true , "buffer_count" : 5 , "buffer_size" : 1e8 , "max_in_cpu" : 1e9 } , "overlap_comm" : true , "contiguous_gradients" : true , "sub_group_size" : 1e9 , "reduce_bucket_size" : 5e7 , "stage3_prefetch_bucket_size" : 5e7 , "stage3_param_persistence_threshold" : 1e4 , "stage3_max_live_parameters" : 1e9 , "stage3_max_reuse_distance" : 1e9 , "stage3_gather_16bit_weights_on_model_save" : true } , "gradient_clipping" : 1.0 , "bf16" : { "enabled" : true } }
dereambooth_config.toml 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 ae = "train/sd-models/flux-ae.safetensors" blocks_to_swap = 0 bucket_no_upscale = true bucket_reso_steps = 64 cache_latents = true cache_latents_to_disk = true cache_text_encoder_outputs = true cache_text_encoder_outputs_to_disk = true caption_dropout_every_n_epochs = 0 caption_dropout_rate = 0 caption_extension = ".txt" clip_l = "/root/autodl-tmp/kohya_ss/train/sd-models/clip_l.safetensors" cpu_offload_checkpointing = true discrete_flow_shift = 3.1582 double_blocks_to_swap = 0 dynamo_backend = "no" epoch = 50 fp8_base = true full_bf16 = false gradient_checkpointing = true guidance_scale = 1 huber_c = 0.1 huber_scale = 1 huber_schedule = "snr" keep_tokens = 0 learning_rate = 4 e-6 learning_rate_te = 0 loss_type = "l2" lr_scheduler = "constant" lr_scheduler_args = []lr_scheduler_num_cycles = 1 lr_scheduler_power = 1 lr_warmup_steps = 0 max_bucket_reso = 1024 max_data_loader_n_workers = 0 max_grad_norm = 1 max_timestep = 500 max_token_length = 75 max_train_steps = 250 min_bucket_reso = 256 mixed_precision = "bf16" model_prediction_type = "sigma_scaled" multires_noise_discount = 0.3 no_token_padding = true noise_offset_type = "Original" optimizer_args = [ "scale_parameter=False" , "relative_step=False" , "warmup_init=False" , "weight_decay=0.01" ,]optimizer_type = "Adafactor" output_dir = "outputs" output_name = "Quality_1" persistent_data_loader_workers = 0 pretrained_model_name_or_path = "train/sd-models/flux1-dev.safetensors" prior_loss_weight = 1 resolution = "512,512" sample_sampler = "euler_a" save_every_n_epochs = 10 save_model_as = "safetensors" save_precision = "bf16" sdpa = true seed = 1 single_blocks_to_swap = 0 t5xxl = "train/sd-models/t5xxl_fp16.safetensors" t5xxl_max_token_length = 225 timestep_sampling = "sigmoid" train_batch_size = 1 train_blocks = "all" train_data_dir = "train/images" wandb_run_name = "Quality_1"
1. Traceback (most recent call last):OSError: Cannot find empty port in range: 28001-28001. You can specify a different port by setting the GRADIO_SERVER_PORT environment variable or passing the server_port parameter to launch() 1 2 3 4 5 6 7 8 File "E:\lora training\lora-scripts-v1.8.5\mikazuki\dataset-tag-editor\scripts\launch.py", line 102, in <module> interface.main() File "E:\lora training\lora-scripts-v1.8.5\mikazuki\dataset-tag-editor\scripts\interface.py", line 218, in main app, _, _ = interface.launch( File "E:\lora training\lora-scripts-v1.8.5\venv\lib\site-packages\gradio\blocks.py", line 1907, in launch ) = networking.start_server( File "E:\lora training\lora-scripts-v1.8.5\venv\lib\site-packages\gradio\networking.py", line 207, in start_serverraise OSError OSError: Cannot find empty port in range: 28001-28001. You can specify a different port by setting the GRADIO_SERVER_PORT environment variable or passing the `server_port` parameter to `launch()`.
- 运行命令
- `netstat -ano | findstr :28001`
- `taskkill /PID 12345 /F`
2. torch.OutOfMemoryError: CUDA out of memory.
报错显卡内存报错但是实际可能是系统内存溢出导致,需要重新修改batch_size
3 . use_libuv = 0
参考文章 : Introduction to Libuv TCPStore Backend
其中route 3 中提示,如果在环境变量中设置了use_libuv = 0 但是在代码中赋值为True , 则依旧会按照代码中执行,所以后面我将所有报错的文件中的lib_use设置为固定值False
4. ValueError: DistributedDataParallel device_ids and output_device arguments only work with single-device/multiple-device GPU modules or CPU modules, but got device_ids [1], output_device 1, and module parameters {device(type='cpu')}
增加新的参数,取消在cpu进行分片 :--blocks_to_swap = 0 E:/lora training/lora-scripts-v1.8.5/sd-models/flux-ae.safetensors E:/lora training/lora-scripts-v1.8.5/sd-models/clip_l.safetensors E:/lora training/lora-scripts-v1.8.5/sd-models/t5xxl_fp16.safetensors E:/Kohya_FLUX_DreamBooth_v18/kohya_ss/train E:/lora training/lora-scripts-v1.8.5/sd-models/flux1-dev.safetensors
是因为模型张量没有初始话导致,修改下面路径的/root/autodl-tmp/kohya_ss/sd-scripts/library/flux_utils.py
6. deepspeed OOM 原因是因为deepspeed在训练flux时在显存和内存中加载数据导致的OOM 训练模型: 全量微调Flux1-dev模型 配置 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 PyTorch 2.7.0 Python 3.12(ubuntu22.04) CUDA 12.8 GPU RTX 5090(32GB) * 4 CPU64 vCPU Intel(R) Xeon(R) Gold 6459C 内存360GB 硬盘 系统盘:30 GB 数据盘:免费:50GB SSD 付费:440GB
解决办法: 在显存活内存不足的情况下,使用deepspeed中的nvme数据盘进行存储数据,将所有的数据压力转移到硬盘中
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 { "train_batch_size" : "auto" , "train_micro_batch_size_per_gpu" : 1 , "gradient_accumulation_steps" : "auto" , "steps_per_print" : 1 , "zero_optimization" : { "stage" : 3 , "offload_optimizer" : { "device" : "nvme" , "nvme_path" : "/root/autodl-tmp/ds_cache" , "pin_memory" : true , "buffer_count" : 5 , "fast_init" : false } , "offload_param" : { "device" : "nvme" , "nvme_path" : "/root/autodl-tmp/ds_cache" , "pin_memory" : true , "buffer_count" : 5 , "buffer_size" : 1e8 , "max_in_cpu" : 1e9 } , "overlap_comm" : true , "contiguous_gradients" : true , "sub_group_size" : 1e9 , "reduce_bucket_size" : 5e7 , "stage3_prefetch_bucket_size" : 5e7 , "stage3_param_persistence_threshold" : 1e4 , "stage3_max_live_parameters" : 1e9 , "stage3_max_reuse_distance" : 1e9 , "stage3_gather_16bit_weights_on_model_save" : true } , "gradient_clipping" : 1.0 , "bf16" : { "enabled" : true } }
7. TypeError: adam_update(): incompatible function arguments. deepspeed 3 使用这个函数传入eps的值错误导致的问题出现
修改代码
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 beta1, beta2 = group['betas' ] eps_val = group['eps' ] if isinstance (eps_val, tuple ) or isinstance (eps_val, list ): eps_val = 1e-8 else : eps_val = float (eps_val) step_val = state['step' ] if hasattr (step_val, 'item' ): step_val = int (step_val.item()) else : step_val = int (step_val) bias_correction_val = bool (group['bias_correction' ]) print ("\n" + "=" *30 + " DEBUG ADAM UPDATE " + "=" *30 ) try : arg_list = [ ("0. opt_id (int)" , self .opt_id), ("1. step (int)" , state['step' ]), ("2. lr (float)" , group['lr' ]), ("3. beta1 (float)" , beta1), ("4. beta2 (float)" , beta2), ("5. eps (float)" , group['eps' ]), ("6. weight_decay (float)" , group['weight_decay' ]), ("7. bias_correction (bool)" , group['bias_correction' ]), ("8. param (Tensor)" , p.data), ("9. grad (Tensor)" , p.grad.data), ("10. exp_avg (Tensor)" , state['exp_avg' ]), ("11. exp_avg_sq (Tensor)" , state['exp_avg_sq' ]) ] for name, val in arg_list: if hasattr (val, 'shape' ): print (f"[{name} ]: Type={type (val)} , Dtype={val.dtype} , Device={val.device} , Shape={val.shape} " ) else : print (f"[{name} ]: Type={type (val)} , Value={val} " ) except Exception as e: print (f"DEBUG ERROR: {e} " ) print ("=" *80 + "\n" ) self .ds_opt_adam.adam_update(self .opt_id, state['step' ], group['lr' ], beta1, beta2, eps_val, group['weight_decay' ], bias_correction_val, p.data, p.grad.data, state['exp_avg' ], state['exp_avg_sq' ]) return loss
Problem 0 :****** , No Data ! /root/miniconda3/envs/kohyass/lib/python3.11/site-packages/transformers/modeling_utils.py
Line 2031 : Add enbaled parameter
init_contexts = [deepspeed.zero.Init(config_dict_or_path=deepspeed_config(),**enabled = False**), set_zero3_state()]
Reason : 本地下载模型, transformers会使用 meta tensor , 但是导入checkpoint会发现是空的meta从而报错,禁止deepspeed进行默认初始化
https://github.com/zai-org/ChatGLM-6B/issues/530
mat1 and mat2 not equal …. kohya_ss/sd-scripts/library/flux_models.py
line 1068 增加txt和img的向量类型,因为clip与t5处理类型为 folat32 , 设置其他的类型为bfolat16导致
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 def forward( self, img: Tensor, img_ids: Tensor, txt: Tensor, txt_ids: Tensor, timesteps: Tensor, y: Tensor, block_controlnet_hidden_states=None, block_controlnet_single_hidden_states=None, guidance: Tensor | None = None, txt_attention_mask: Tensor | None = None, ) -> Tensor: target_dtype = self.img_in.weight.dtype #以此层权重类型为准 if img.dtype != target_dtype: img = img.to(target_dtype) if txt.dtype != target_dtype: txt = txt.to(target_dtype) if timesteps.dtype != target_dtype: timesteps = timesteps.to(target_dtype) if guidance is not None and guidance.dtype != target_dtype: guidance = guidance.to(target_dtype) if y is not None and y.dtype != target_dtype: y = y.to(target_dtype) if img.ndim != 3 or txt.ndim != 3: raise ValueError("Input img and txt tensors must have 3 dimensions.") ========================== Next Code
deepseed AttributeError: ‘DeepSpeedZeRoOffload’ object has no attribute ‘backward’ 是因为deepspeed未初始化,可以在下面的位置打一个print进行查看
/root/autodl-tmp/kohya_ss/sd-scripts/library/deepspeed_utils.py Line 87
kohya_ss/sd-scripts/library/deepspeed_utils.py Line 64 在deepspeed未设置会直接跳过一个返回None
NCCL enqueue.cc:1556 NCCL WARN Cuda failure 700 ‘an illegal memory access was encountered’ pip install nvidia-nccl-cu12>2.26.2 在5090 上会出现这个错误,不影响训练
1 2 3 4 5 6 7 8 9 10 11 # EXAMPLE <ExampleReadMe> Summary: This tool is in the file `Processor.cs`. The core logic is handled by the `DataParser` class, which uses the `Autodesk.Revit.DB.Transaction` API. </ExampleReadMe> <ExampleJSONOutput> {{ "target_files": ["Processor.cs"], "key_classes_and_methods": ["DataParser"], "mentioned_apis": ["Autodesk.Revit.DB.Transaction"] }} </ExampleJSONOutput>