ARIA: An Open Multimodal Native Mixture-of-Experts Model

> outperforms Pixtral-12B and Llama3.2-11B

Cool, maybe needs of a better name for SEO though. ARIA has meaning in web apps.

panarchy 9 months ago

They could call it Maria (MoE Aria) won't help with standing out in searches however. Maybe MarAIa so it would be more unique.
I'm here all night if anyone else needs some other lazy name suggestions.

In an MoE model such as this, are all "parts" loaded in Memory at the same time, or at any given time only one part is loaded? For example, does Mixtral-8x7B have the memory requirement of a 7B model, or a 56B model?

0tfoaij 9 months ago

MoE's still require the total number of parameters (46b, not 56b, there's some overlap) to be in ram/vram, but the benefit is that the inference speed will be based on the amount of active parameters used, which in the case of Mixtral is 2 experts at 7b each for an inference speed comparable to 14b dense models. This 3x improvement in inference speed would be worth the additional ram usage alone, especially for cpu inference where memory bandwidth rather than total memory capacity is the limiting factor, but as a bonus there's a general rule you can use calculate how well MoE's will compare to dense models by taking the square root of the active parameters * total parameters, meaning Mixtral ends up comparing favourably to 25b dense models for example. In the case of ARIA it's going to have the memory usage of a 25b model, with the performance of a 10b~ model while running as fast as a 4b model. This is a nice trade off to make if you can spare the additional ram.
If it helps, MoE's aren't just disparate 'expert' models trained to deal with specific domain knowledge jammed into a bigger model, but rather are the same base model trained in similar ways where each model ends up specialising on individual tokens. As the image dartos linked shows, you can end up with some 'experts' in the model that really, really like placing punctuation or language syntax for whatever reason.
dartos 9 months ago

Closer to 56.
All part are loaded in as any could be called upon to generate the next token.
- theanonymousone 9 months ago
  
  Aha, so it's decided per token, not per input. I thought at first the LLM chooses a "submodel" based on the input and then follows it to generate the whole output.
  Thanks a lot.
  
  dartos 9 months ago
  
  Yeah, this image helped solidify that for me.
  https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_pr...
  Each different color highlight is a generated by a different expert.
  You can see that the "experts" are more experts of syntax than concepts. Notice how the light blue one almost always generates puncuation and operators. (until later layers when the red one does so)
  I'm honestly not too sure the mechanism behind which experts gets chosen. I'm sure it's encoded in the weights somehow, but I haven't gone too deep into MoE models.
  
  MacsHeadroom 9 months ago
  
  I see a whitespace expert, a punctuation expert, and a first word expert. It's interesting to see how the experts specialize.
  
  dartos 9 months ago
  
  Right?
  Then you get some strange ones where parts of whole words are generated by different experts.
  Makes me think that there’s room for improvement in the expert selection machinery, but I don’t know enough about it to speculate.
  
  zo1 9 months ago
  
  Honestly, looks like someone throwing spaghetti on a wall a billion times and seeing what sticks, then training the throwing arm to somehow minimize something. I get that LLM magic is kinda magic and is doing some cool stuff, but this looks like it's just chaos and statistical untangling that happens to minimize some random fitness function X-levels down the line.

niutech 9 months ago

I’m curious how it compares with recently announced Molmo: https://molmo.org/

espadrine 9 months ago

The Pixtral report[0] compares positively to Molmo.
(Also, beware, molmo.org is an AI-generated website to absorb through SEO Allen AI’s efforts; the real website is molmo.allenai.org. Note for instance that all tweets listed here are from fake accounts since suspended: https://molmo.org/#how-to-use)
[0]: https://arxiv.org/pdf/2410.07073
bsenftner 9 months ago

Know of where Molmo is being discussed? Looks interesting.

petemir 9 months ago

Model should be available for testing here [0], although I tried to upload a video and got an error in Chinese, and whenever I write something it says that the API key is invalid or missing.

[0] https://rhymes.ai/

vessenes 9 months ago

This looks worth a try. Great test results, very good example output. No way to know if it’s cherry picked / overtuned without giving it a spin, but it will go on my list. Should fit on an M2 Max at full precision.

SubiculumCode 9 months ago

How do you figure out the required memory? The MoE aspect complicates it.
- vessenes 9 months ago
  
  It does; in this case, though, a 25b f16 model will fit. The paper mentions an A100 80G is sufficient but a 40 is not; M2 Max has up to 192G. That said, MoEs are popular in lower memory devices because you can swap out the experts layers -- their expert layers are like 3-4b parameters, so if you are willing to have a sort of pause on generation where you load up the desired expert, you could do it in a lot less RAM. They pitch the main benefit here as faster generation, it's a lot less matmul to do per token generated.
- ProofHouse 9 months ago
  
  Each model added, no?
Onavo 9 months ago

What's the size of your M2 Max memory?
- treefry 9 months ago
  
  Looks like 64GB or more

SomewhatLikely 9 months ago

"Here, we provide a quantifiable definition: A multimodal native model refers to a single model with strong understanding capabilities across multiple input modalities (e.g. text, code, image, video), that matches or exceeds the modality specialized models of similar capacities."