Last month I set out to create a way for my friends to make custom civilization sprites for our Age of Empires II lobbies.
Below are some thoughts and process on how to create a versatile prompt-based image generator. For beginners I would recommend Alpaca, and for those comfortable with coding – Stable Diffusion Web UI and Python.
Special thanks to the AoEII modding communities OpenAge, SLX Studio, and Age of Kings Heaven.
If you’re interested in updates to this project you can follow me at @neilsonks.
Visual Explorations
Screenshot Transfers
This idea came about while playing with image transfer (img2img) on Age of Empires II screenshots. All the techniques described below use the Stable Diffusion generative “AI” (which I also refer to as the model or network).
I started in Photoshop with Alpaca and worked on full game screenshots. Mostly what I was looking for was how the model behaved, what shapes and textures it chose to preserve and what it discarded in its stylistic wanderings. Overall it managed to stay in perspective and theme quite well.
What was immediately interesting was how elements of the image started working together. Tiled terrain gave way to roads and paths – there is communication between the buildings and environment. For now we will only be working with building sprites – but it demonstrates that machine learning could be a good tool for blending procedural elements of a game.
The Caspar David Friedrich Benchmark
Following this I prompted the network to output images into a strongly isometric perspective, looking to get consistent results across lighting, color, shape, and texture.
Some really awesome results, as the strength of the image transfer increases, parts of the minimap and UI would become other buildings or terrain.
Sprite-Diffuser
Anime Loyalists vs. Moon Colonists vs. Zombie Romans
Now it was time to make some civilizations.
When generating the model requires 512×512 images, luckily all the Age of Empires II sprites are about half that so we don’t have to worry about resizing. These were the first results.
These buildings were cherrypicked and cleaned in Photoshop and weren’t batched. In order to get diverse results the strength of the image transfer was set very high, but as a result the lighting, proportion, and level of detail varied greatly. The castle asset (left) would continue to be a pain point as it is 2x larger most buildings, but the network would insist it was an oversized house.
Refining Outputs with Control Net and Loopbacks
To get consistent results while having a high transfer strength I needed to use Control Net, a system that guides generation using input like lineart, depth, or segmentation.
At this point everything worked, sprites stayed in perspective and listened to the prompt, but they failed to be imaginative or unexpected. This can’t be solved by making the prompt strength extremely high, the lighting or color will start to vary too much, control net cannot help there.
Instead, there is a technique called loopback – this runs the generated image back over itself with the same prompt. Without it, the network isn’t able to imagine novel changes to shapes, textures, and color.
The final technique I settled on was Control Net with Zoe Depth Estimation, two loopbacks, both with about 85% strength (only 15% of the original image is preserved).
Prompting
This is arguably the most important part of the process, but I have left it for last. That is because copying what worked for me will probably hold you back, I didn’t spend that much time on prompting! But this is a good place to start.
I used simple formula for each image with the following prompts. The only field that changed for each building was the Subject (i.e. Archery Range, Market, House).
- Shape
- Descriptor
- Subject
- Style
- Emphasis
- Modifiers
- Isometric exterior of a
- ancient Roman
- Barracks
- in the style of Giovanni Paolo Panini
- 3D roman architecture, greco-roman stone and pillars with intricate stonework and roofs
- desaturated, 8k, bright sunny natural lighting, trending on artstation
Lastly, Stable Diffusion does not create transparencies. I thought object detection would work but surprisingly they struggled to find a solid mask. Instead I forced a solid background color in the prompt and created a few transparent flood fills with ImageMagick
Final Result
Other Experiments and Thoughts
Custom Buildings & Fine Tuning
It also possible to reverse the process. Instead of creating different styles of existing sprites, new sprites in the original style could be made using a fine-tuned Age of Empires II model. Either 3D blockouts or photo images of real buildings could work as inputs.
Greyboxing to Image
With one of my existing Unity projects, I took a greyboxed level and applied segmentation to the buildings, ground, and trees according to the ADE20k dataset.
With Control Net and object segmentation we can run it through the same Caspar David Friedrich prompt. This could be used in concept art or level design workflows to quickly block-out environment art.
Image Rich Mindset, Seeding, Spawning
With so much semantic and syntactic information of an image now available, each image in the project can be a starting point for synthesizing new outputs.
By defining 10-20 hero assets using a traditional art workflows, be it characters, buildings, or environments, a development team could then kit bash new assets together. Studios with a large internal catalogue of concept art and assets might be interested in bringing this to fans to extend their lore and worlds.
Conclusion
These models are surprisingly versatile and are a lot of fun to work with. Future games, should they choose to, could create a set of base assets that then seed user-generated lore or internal development. The game world is itself an image model, video games are both a place to be and a desirable frame for future images.
The project is online at @neilsonks and engine.study/sprite-diffuser/.