Sign In

Redline πŸŽ₯ Wan2.1-T2V-14B

61
208
0
12
Verified:
SafeTensor
Type
LoRA
Stats
208
0
3
Reviews
Published
Apr 27, 2025
Base Model
Wan Video
Training
Steps: 16,617
Epochs: 29
Usage Tips
Strength: 1
Trigger Words
Redline style
Training Images
Download
Hash
AutoV2
5B14A4EF9B
seruva19's Avatar
seruva19

Description

Redline is a 2009 anime film directed by Takeshi Koike and produced by Madhouse. The story centers on JP, a racer with a reckless spirit, and Sonoshee McLaren, his rival and eventual ally, as they compete in Redline, the most dangerous and prestigious race in the galaxy. The film is notable for its entirely hand-drawn animation, a production effort that spanned seven years. Its intense, high-energy style and detailed visuals draw comparisons to earlier landmark works like Akira, reflecting a similar commitment to technical excellence and dynamic storytelling.

This is one of my most favorite animation movies of all time, and this is my first attempt to reconstruct its animation style. Other goals behind creating this LoRA were:

- optimize the training pipeline, in particular so training of one LoRA did not take 90 hours πŸ˜‰

- research and expand motion capabilities of Wan

I spent a lot of time (probably too much) trying to solve the second task but did not succeed much (it was expected, one small LoRA can't fulfil all). Although the art style of Redline was transmitted fairly well - camera movement, angles, fast rapid motion etc. were not adopted fully, or at least I did not achieve the accuracy I wanted. Overall I trained 3 variants of LoRA, and spent more than 80 hours on all of them. Finally, I decided to stop after the 3rd iteration because the inability to reach perfection started embarrassing me too much and I thought it would be better to revisit this model later than spending endlessly trying to retrain it more and more without realizing definitely what's wrong.

Usage

LoRA with trained with words "Redline style" in the beginning of each caption. I also used the term "kinetic-deformed" (and captioned relevant scenes) to emphasize the legendary Redline acceleration effect. Most likely this term does not affect the scene, because there were only 3 scenes in the dataset involving this effect, so it might be just a placebo, but since it sounds badass, I usually include it into prompts that are related with fast driving motion.

This LoRA is highly prompt-dependent, and I am still working on finding the best prompt template to unleash its full potential. Below is a template you can use to generate prompts that should produce moderately accurate outputs (replace the last line with any topic of your choice):

You are an advanced prompt generator for AI video generation models. Your goal is to create vivid, cinematic, highly detailed prompts for generating video clips in the style of the Redline animation movie.

Prompt Rules:
- Every prompt must begin with: "Redline style."
- Use clear, simple, direct, and concise language. No metaphors, exaggerations, figurative language, or subjective qualifiers (e.g., no "fierce", "breathtaking").
- Prompt length: 80–100 words.
- Follow this structure: Scene + Subject + Action + Composition + Camera Motion

1. Scene (environment description)
Establish environment type: urban, natural, surreal, etc. Include time of day, weather, visible background events or atmosphere. Only describe what is seen; no opinions or emotions.

2. Subject (detailed description)
Describe only physical traits, appearance, outfit. Use vivid but minimal adjectives (no occupations like "biker", "mechanic", etc.) No excessive or flowery detail.

3. Action (subject and environment movement)
Specify only one clear subject and/or environmental interaction. Describe only what can be seen in 5 seconds.

4. Composition and Perspective (framing)
Choose from: Close-up | Medium shot | Wide shot | Low angle | High angle | Overhead | First-person | FPV | Bird’s-eye | Profile | Extreme long shot | Aerial

5. Motion (cinematic movement)
Use: Dolly in | Dolly out | Zoom-in | Zoom-out | Tilt-up | Tilt-down | Pan left | Pan right | Follow | Rotate 180 | Rotate 360 | Pull-back | Push-in | Descend | Ascend | 360 Orbit | Hyperlapse | Crane Over | Crane Under | Levitate
Describe clearly how the camera moves and what it captures. Focus on lighting, mood, particle effects (like dust, neon reflections, rain), color palette if needed. Be visually descriptive, not emotional. Keep each motion or camera movement concise β€” each representing about 5 seconds of video. Maintain a strong visual "Redline" animation aesthetic: bold, vibrant, energetic, fluid animation feeling.

Use simple prompts, like you're instructing a 5-year old artist.  

Now, make 10 close up shots of expressive and dangerous women in universe of Redline.

Some feature bleed can be witnessed (like, women can sometimes have a JP's pompadour hairstyle), but I neglected this from the beginning, since I wanted to make a style LoRA, not a character. (Also I realized that woman with pompadour looks extremely badass πŸ˜™). Most of the times, if prompted a man or woman without describing specific appearance traits, JP or Sonoshee is expected to appear.

Workflows are embedded within each mp4. Here is example of workflow in JSON: https://files.catbox.moe/31mpay.json.

As before, I used plenty of optimizations (including TeaCache) to be able to render a 640x480x81 clip in ~5 minutes on RTX 3090. In my opinion, TeaCache does not destroy motion as hard as it is commonly believed. (I am talking about animation only, I never generated realistic videos with Wan so I cannot say about that). Yes, it makes quality a bit worse, but if the clip is bad with TeaCache, turning it off does not make it good. Fast transitions and rapid motion are still very hit-or-miss either with TeaCache or without it.

Compatibility with other LoRAs and with I2V checkpoints was not tested.

Training

Mostly I reused routine from previous LoRA, i.e. mixed training with images and videos with different resolution and duration buckets. I used musubi-tuner (Windows 11, RTX 3090, 64 Gb RAM). I optimized and refined my training pipeline, adopting some practices from training of other creators (in particular I would like to thank blyss for her detailed insights and blip for their useful tips regarding speeding up training). In comparison to training pipeline of my previous LoRA, training now was almost x3 times faster, so one iteration took ~5s (instead of 12-13 sec) on RTX 3090. With new parameters I could train my previous LoRA in 30 hours instead of 90. Nice.

(I uploaded all training data with training configs alongside LoRA, so you can check them if you'd like.)

Overall, the most notable changes were:

  • fp16 checkpoint (instead of bf16) + fp8_base + fp8_scaled

  • no block swapping (thanks to optimized dataset structure, see later πŸ‘‡)

  • CAME optimizer (instead of adamw8bit)

  • FlashAttention training acceleration (instead of sdpa)

  • using loraplus_lr_ratio=2 with lower lr (3e-5 instead of 5e-5)

Regarding dataset, the main change was that I splitted all videos into separate duration buckets (for more effective usage), and had to decrease training resolution (to prevent need for block swapping with 24 GB VRAM). Overall procedure was:

  • get source film in highest (reasonably) possible quality - 1864x1048, H.265 17104 Kb/s

  • split it on fragments (with PySceneDetect)

  • select appropriate fragments (with a custom primitive GUI for fast video selection and navigation) - 175 clips in total

  • convert them to 16fps, remove audio (ffmpeg)

  • extract keyframes that would serve as high resolution image dataset (ffmpeg) - 170 images in total

  • split videos by duration buckets - total of 28 folders (buckets)

  • generate dataset toml config file for musubi-tuner with parameters optimized for each duration bucket (see later πŸ‘‡)

  • generate captions for images (see later πŸ‘‡)

  • generate captions for videos (I used "dual-captions" approach, "short" version with overall scene description and "long" version with detailed caption, see later πŸ‘‡)

I won't show here full dataset.toml file (it is ~600 lines long), but the main idea was the same as in my previous LoRA - "three-tier" dataset. Since this time I decided to fit everything into VRAM without using block swapping (for maximizing training speed), I had to lower target training resolutions.

  • 1: high-resolution image dataset 976x544 - max I could afford without block swapping (on Linux or with diffusion-pipe this probably could be higher)

  • 2: medium resolution video dataset of short frame duration 512x288x17

  • 3: low resolution video dataset with max frame duration 256x144x81

(I also tried training with "dual-tier" dataset, medium-res small-duration videos + high-res images, but it was not as effective.)

Here is dataset record in config used for image dataset (170 images):

[[datasets]]
image_directory = "H:/datasets/redline/images/1864x1048x1"
cache_directory = "H:/datasets/redline/images/1864x1048x1/cache"
resolution = [976, 544]
batch_size = 1
num_repeats = 1
caption_extension = ".txt"

For captioning images I used (locally) Ovis2-16B in single-image input mode, caption prompt was:

Describe this scene. Do not use the word "image". If there are people in the scene, explicitly state their gender. Begin the description with: "Redline style. "

For "second tier" dataset I used videos (175 clips) with lower (but not lowest resolution) of 512x288. Their actual duration was from 25 frames and higher, but target frame duration (target_frames) on config was always equal to [17], frame_extraction "head". For each of 28 duration buckets (folders) config section always looked the same (only 28 folder names changing):

[[datasets]]
video_directory = "H:/datasets/redline/videos/1864x1048x25"
cache_directory = "H:/datasets/redline/videos/1864x1048x25/cache_s"
resolution = [512, 288]
batch_size = 1
num_repeats = 1
frame_extraction = "head"
target_frames = [17]
caption_extension = ".short"

For this dataset "short" caption files were used. For captioning this I also used Ovis2-16B, but in video input mode. Caption prompt was:

Describe this scene in brief. If there are people in the scene, explicitly state their gender. If they are talking or speaking, explicitly mention it. Describe subjects and their actions first, then describe background and environment. Begin the description with: "Redline style. "

"Third tier" dataset consisted of the same 175 videos, but with lowest training resolution 256x144, frame_extractions set to "uniform". For buckets (folders with videos) where number of frames exceeded 81, frame_samples was set to 2, target_frames to [81]. For buckets where number of frames was fewer than 81, frame_samples was set to 1, target_frames set to [X] where X is flattened (closest to evenly divisible by 4N+1) duration of the bucket (25, 29, 33...)

(OK, I get that all that may sound too complex, but of course I didn't calculate all this by hand, I ordered an appropriate script and Claude made it, I just carefully formulated the requirements. This procedure was maybe redundant and didn't influence the training output as much as the source dataset itself, because previous - my best - LoRA worked well without these measures; but I just prefer to control all buckets and dataset structure manually because it may help to identify weak points in the dataset after training. Well, in theory.)

Example for bucket folders with fragments duration < 81 frames:

[[datasets]]
video_directory = "H:/datasets/redline/videos/1864x1048x25"
cache_directory = "H:/datasets/redline/videos/1864x1048x25/cache_l"
resolution = [256, 144]
batch_size = 1
num_repeats = 1
frame_extraction = "head"
target_frames = [25]
caption_extension = ".long"

Example for bucket folders with fragments duration >= 81 frames:

[[datasets]]
video_directory = "H:/datasets/redline/videos/1864x1048x97"
cache_directory = "H:/datasets/redline/videos/1864x1048x97/cache_l"
resolution = [256, 144]
batch_size = 1
num_repeats = 1
frame_extraction = "uniform"
target_frames = [81]
caption_extension = ".long"
frame_sample = 2

For this dataset, "long" caption files were used. They were also generated with Ovis2-16B in video input mode and a following caption prompt (note that, unlike with the short captions, this time I also instructed it to caption the background first):

Describe this scene in details. If there are people in the scene, explicitly state their gender. If they are talking or speaking, explicitly mention it. Describe background and environment first, then describe subjects and their actions. Begin the description with: "Redline style. "

Regarding "dual-captioning", I expected it to be another measure against potential overfitting, since the same fragment had to be learned by the model through the lens of two different captions, effectively serving as a form of "caption augmentation." (I actually found this idea in Seaweed-7B paper and decided to adapt it.)

Training was carried out for 50 epochs (573 steps per epoch), but subsequent tests showed that the LoRA at step 16 617 (epoch 29) was the most stable and versatile (so, eventually, actual training took ~23 hours). By the way, the LoRA at epoch 10 (5730 steps) was already able to render the art style of Redline pretty well, but the motion was entirely inherited from the base Wan model, which I could not accept.

I mentioned this is the third version of LoRA. First one was poorly captioned (I used minimalistic captions made by Gemini, quality-wise these captions were good, but I still did not like the result of training, I believe Wan does not like short captions at all). The second one used "dual-tier" dataset, it turned out OK, but not OK enough. The third one is this one, it's still only not as good as I wish it was, but, as I said, I felt like I have to stop and take a break to preserve sanity. This LoRA could be better, but it could be worse. I still have some ideas for improving the quality of LoRA training, which I plan to test in upcoming versions.