AI & ML interests

None defined yet.

Recent Activity

pcuenqย 
posted an update about 19 hours ago
view post
Post
843
๐Ÿ‘‰ What happened in AI in 2025? ๐Ÿ‘ˆ

We prepared the 2025 version of the HF AI Timeline Grid, highlighting open vs API-based model releases, and allowing you to browse and filter by access, modality, and release type!

Play with it here:
2025-ai-timeline/2025-ai-timeline

Here's my personal quarterly TL;DR:

1๏ธโƒฃ Q1 โ€” Learning to Reason
Deepseek not only releases a top-notch reasoning model, but shows how to train them and compete with closed frontier models. OpenAI debuts Deep Research.

Significant milestones: DeepSeek R1 & R1-Zero, Qwen 2.5 VL, OpenAI Deep Research, Gemini 2.5 Pro (experimental)

2๏ธโƒฃ Q2 โ€” Multimodality and Coding
More LLMs embrace multimodality by default, and there's a surge in coding agents. Strong vision, audio, and generative models emerge.

Significant milestones: Llama 4, Qwen 3, Imagen 4, OpenAI Codex, Google Jules, Claude 4

3๏ธโƒฃ Q3 โ€” "Gold" rush, OpenAI opens up, the community goes bananas
Flagship models get gold in Math olympiads and hard benchmarks. OpenAI releases strong open source models and Google releases the much anticipated nano-banana for image generation and editing. Agentic workflows become commonplace.

Significant milestones: Gemini and OpenAI IMO Gold, gpt-oss, Gemini 2.5 Flash Image, Grok 4, Claude Sonnet 4.5

4๏ธโƒฃ Q4 โ€” Mistral returns, leaderboard hill-climbing
Mistral is back with updated model families. All labs release impressive models to wrap up the year!

Significant milestones: Claude Opus 4.5, DeepSeek Math V2, FLUX 2, GPT 5.1, Kimi K2 Thinking, Nano Banana Pro, GLM 4.7, Gemini 3, Mistral 3, MiniMax M2.1 ๐Ÿคฏ

Credits
๐Ÿ™ NHLOCAL for the source data https://github.com/NHLOCAL/AiTimeline

๐Ÿซก @reach-vb for the original idea, design and recipe

๐Ÿ™Œ @ariG23498 and yours truly for compiling and verifying the 2025 edition

๐Ÿฅณ Here's to 2026, wishing it becomes the best year ever for open releases and on-device-first use-cases! ๐Ÿฅ‚
multimodalartย 
posted an update 3 months ago
view post
Post
8415
Want to iterate on a Hugging Face Space with an LLM?

Now you can easily convert any HF entire repo (Model, Dataset or Space) to a text file and feed it to a language model!

multimodalart/repo2txt
multimodalartย 
posted an update 7 months ago
view post
Post
17986
Self-Forcing - a real-time video distilled model from Wan 2.1 by @adobe is out, and they open sourced it ๐Ÿ

I've built a live real time demo on Spaces ๐Ÿ“น๐Ÿ’จ

multimodalart/self-forcing
ยท
multimodalartย 
posted an update over 1 year ago
multimodalartย 
posted an update over 1 year ago
view post
Post
28597
The first open Stable Diffusion 3-like architecture model is JUST out ๐Ÿ’ฃ - but it is not SD3! ๐Ÿค”

It is Tencent-Hunyuan/HunyuanDiT by Tencent, a 1.5B parameter DiT (diffusion transformer) text-to-image model ๐Ÿ–ผ๏ธโœจ, trained with multi-lingual CLIP + multi-lingual T5 text-encoders for english ๐Ÿค chinese understanding

Try it out by yourself here โ–ถ๏ธ https://huggingface.co/spaces/multimodalart/HunyuanDiT
(a bit too slow as the model is chunky and the research code isn't super optimized for inference speed yet)

In the paper they claim to be SOTA open source based on human preference evaluation!
pcuenqย 
posted an update over 1 year ago
view post
Post
10264
OpenELM in Core ML

Apple recently released a set of efficient LLMs in sizes varying between 270M and 3B parameters. Their quality, according to benchmarks, is similar to OLMo models of comparable size, but they required half the pre-training tokens because they use layer-wise scaling, where the number of attention heads increases in deeper layers.

I converted these models to Core ML, for use on Apple Silicon, using this script: https://gist.github.com/pcuenca/23cd08443460bc90854e2a6f0f575084. The converted models were uploaded to this community in the Hub for anyone that wants to integrate inside their apps: corenet-community/openelm-core-ml-6630c6b19268a5d878cfd194

The conversion was done with the following parameters:
- Precision: float32.
- Sequence length: fixed to 128.

With swift-transformers (https://github.com/huggingface/swift-transformers), I'm getting about 56 tok/s with the 270M on my M1 Max, and 6.5 with the largest 3B model. These speeds could be improved by converting to float16. However, there's some precision loss somewhere and generation doesn't work in float16 mode yet. I'm looking into this and will keep you posted! Or take a look at this issue if you'd like to help: https://github.com/huggingface/swift-transformers/issues/95

I'm also looking at optimizing inference using an experimental kv cache in swift-transformers. It's a bit tricky because the layers have varying number of attention heads, but I'm curious to see how much this feature can accelerate performance in this model family :)

Regarding the instruct fine-tuned models, I don't know the chat template that was used. The models use the Llama 2 tokenizer, but the Llama 2 chat template, or the default Alignment Handbook one that was used to train, are not recognized. Any ideas on this welcome!
ยท
multimodalartย 
posted an update almost 2 years ago
view post
Post
The Stable Diffusion 3 research paper broken down, including some overlooked details! ๐Ÿ“

Model
๐Ÿ“ 2 base model variants mentioned: 2B and 8B sizes

๐Ÿ“ New architecture in all abstraction levels:
- ๐Ÿ”ฝ UNet; โฌ†๏ธ Multimodal Diffusion Transformer, bye cross attention ๐Ÿ‘‹
- ๐Ÿ†• Rectified flows for the diffusion process
- ๐Ÿงฉ Still a Latent Diffusion Model

๐Ÿ“„ 3 text-encoders: 2 CLIPs, one T5-XXL; plug-and-play: removing the larger one maintains competitiveness

๐Ÿ—ƒ๏ธ Dataset was deduplicated with SSCD which helped with memorization (no more details about the dataset tho)

Variants
๐Ÿ” A DPO fine-tuned model showed great improvement in prompt understanding and aesthetics
โœ๏ธ An Instruct Edit 2B model was trained, and learned how to do text-replacement

Results
โœ… State of the art in automated evals for composition and prompt understanding
โœ… Best win rate in human preference evaluation for prompt understanding, aesthetics and typography (missing some details on how many participants and the design of the experiment)

Paper: https://stabilityai-public-packages.s3.us-west-2.amazonaws.com/Stable+Diffusion+3+Paper.pdf
ยท
multimodalartย 
posted an update almost 2 years ago
multimodalartย 
posted an update almost 2 years ago
view post
Post
It seems February started with a fully open source AI renaissance ๐ŸŒŸ

Models released with fully open dataset, training code, weights โœ…

LLM - allenai/olmo-suite-65aeaae8fe5b6b2122b46778 ๐Ÿง 
Embedding - nomic-ai/nomic-embed-text-v1 ๐Ÿ“š (sota!)

And it's literally February 1st - can't wait to see what else the community will bring ๐Ÿ‘€