Zamba2-7B

282 points by dataminer 9 months ago

When they say that they use two attention heads, are each attention head directed at different aspects of the data?

In memory research there is this idea that there is a dual representation of every event...a more verbatim representation, and more context weighted representation. As we develop over early childhood, our verbatim memory representations increase in fidelity and strength against interference, but peaks around 6 to 10 years, depending on the specifics. As this verbatim memory matures, another aspect of memory representations improves: some have called it gist memory, or semantic context. Increases in memory performance continue into adolescence primarily due to increases in the ability to use context and gist (broad representations that capture the details by inference or an event) to increase accuracy overall, but also greater likelihood of committing false alarms to lures primed by semantically related material during learning...expressly because there becomes greater reliance on context to support recall accuracy.

So I could imagine such a system in a LLM where attention is directed to exact representations in one head, and another that keeps its attention on a coarser grain of information that anchors information. However, I am not that familiar with LLMs to know if that is just silly analogizing.

kla-s 9 months ago

Please someone correct me if I’m wrong, but my understanding of ML/LLMs is that this kind of hand crafting has been tried, but it is easier to train/less finicky to let behavior like this emerge from more data, see [1] “Bitter Lesson” and [2] “Scaling Laws”.
MAMBA as an architecture claims to have some significant gains performance wise, but to my knowledge there haven't been any really large models (>~100B params) with open weights/leaked MAMBA architecture been disclosed other than this (7B).
As mentioned by other comments, another dimension not to forget is the training data quality. Not only quantity but also quality really matters, is what we are learning more and more with LLMs..
[1] http://www.incompleteideas.net/IncIdeas/BitterLesson.html [2] see eg https://m.youtube.com/watch?v=5eqRuVp65eY&pp=ygUMU2NhbGluZyB... for a well made/easily digestable intro
- sanxiyn 9 months ago
  
  Jamba 1.5 Large is 398B params (94B active) and weights are available.
  https://arxiv.org/abs/2408.12570
  
  littlestymaar 9 months ago
  
  Thanks for the link. The benchmark results aren't too impressive for its size but it likely hasn't been trained as thoroughly as llama (I couldn't find the training size in the paper but I doubt they have access to as much compute as Meta) so it still feels encouraging that it doesn't look ridiculous either.
  
  x_may 9 months ago
  
  Not as much as meta, no. But AI21 labs is partnered with Amazon and did a ~$200M funding round last year IIRC so still plenty of funds for training big models
  
  kla-s 9 months ago
  
  Thanks, missed that one.
  For context gpt-4 is supposedly @ 1.8T params.

jwitthuhn 9 months ago

For anyone else looking for the weights which as far as I can tell are not linked in the article:

Base model: https://huggingface.co/Zyphra/Zamba2-7B

Instruct tuned: https://huggingface.co/Zyphra/Zamba2-7B-Instruct

keyle 9 months ago

I couldn't find any gguf files yet. Looking forward to trying it out when they're available.
- kristianp 9 months ago
  
  It seems that zamba 2 isn't supported yet, the previous model's issue is here:
  Feature Request: Support Zyphra/Zamba2-2.7B #8795
  Open tomasmcm opened this issue on Jul 31 · 1 comment
  https://github.com/ggerganov/llama.cpp/issues/8795
- alchemist1e9 9 months ago
  
  What can be used to run it? I had imagined Mamba based models need a different interference code/software than the other models.
  
  gbickford 9 months ago
  
  If you look in the `config.json`[1] it shows `Zamba2ForCausalLM`. You can use a version of the transformers library to do inference that supports that.
  The model card states that you have to use their fork of transformers.[2]
  1. https://huggingface.co/Zyphra/Zamba2-7B-Instruct/blob/main/c...
  2. https://huggingface.co/Zyphra/Zamba2-7B-Instruct#prerequisit...
  
  hidelooktropic 9 months ago
  
  To run gguf files? LM Studio for one. I think recurse on macos as well and probably some others.
  
  x_may 9 months ago
  
  As another commenter said, this has no GGUF because it’s partially mamba based which is unsupported in llama.cpp
  
  xyc 8 months ago
  
  dev of https://recurse.chat/ here, thanks for mentioning! rn we are focusing on features like shortcuts/floating window, but will look into support this in some time. to add to the llama.cpp support discussion, it's also worth noting that llama.cpp does not yet support gpu for mamba models https://github.com/ggerganov/llama.cpp/issues/6758
  
  wazoox 9 months ago
  
  Gpt4all is a good and easy way to run gguf models.
- Havoc 9 months ago
  
  Mamba based stuff tends to take longer to become available

potatoman22 9 months ago

I wonder how much of the performance gains can be attributed to their improved dataset rather than their architecture. That would be an expensive experiment.

hack_ml 8 months ago

The ablation studies and the dataset can be found here: https://www.zyphra.com/post/building-zyda-2

arnaudsm 9 months ago

I'm tired of LLM releases that cherry pick benchmarks. How does it compare to SOTA qwen2.5/phi3.5 ?

Anyone knows an up to date independent leaderboard? Lmsys and livebench used to be great but skipped most major models recently.

reissbaker 9 months ago

Phi 3.5 is pretty bad in practice, the Phi series always benchmarks well on the popular benchmarks and then falls over IRL (or on less-popular benchmarks). It would be nice to see it against Qwen2.5, but the Qwen team didn't release any evals on the 7B version AFAIK, so I can see why the Zamba folks compared it against other published benchmarks of similar-sized models.
In general the idea with these hybrid SSM architectures is to show that you can get good results with fewer training tokens, and to significantly improve inference speed. Even if Qwen2.5 was better at MMLU, etc, it definitely used way more training tokens to get there (18T tokens for Qwen2.5 vs 3T for Zamba2), so Zamba2 is still a pretty useful result.
TBD if Zamba2 is actually good in real world usage (Phi3.5 for example used only 3.4T tokens and got good public benchmark results, it's just not very good at anything other than the public benchmarks), but Jamba1.5 -- another hybrid SSM architecture -- did seem to do quite well on the LMSys leaderboards (which are admittedly these days not a super effective measure, but still feel less gameable than MMLU), so I'm moderately hopeful that this is a real architectural win and not just gamed benchmarks.
metalwhale 9 months ago

I think it cannot surpass SOTA in some LM evaluation sets, but please understand that achieving better results requires a very good training dataset, which not everyone can afford.
On the other hand, the main points of Zamba/Mamba are low latency, generation speed, and efficient memory usage. If this is true, LLMs could be much easier for everyone to use. All we need to do is wait for someone with a good training dataset to train a SOTA Mamba.

adt 9 months ago

https://lifearchitect.ai/models-table/

Havoc 9 months ago

Nice to see more apache licensed models especially with different architectures

diggan 9 months ago

In this case, it seems it is just the weights that are Apache licensed, which doesn't quite fit. Apache license is primarily designed for software, not binary data like video or music, we typically use Creative Commons or similar for those types of things.
Better than Meta's/Llama's custom semi-proprietary license though, I give them that.
- Havoc 9 months ago
  
  Yeah apache seems about as good as it gets on models.

PoignardAzur 9 months ago

For the amount of theoretical work behind those Mamba2 blocks (I can barely understand their paper on the subject), those are some extremely modest performance gains.

Attention remains king.

visarga 9 months ago

> I can barely understand their paper on the subject
Yannic Kilcher has a new video touching on Mamba in an intuitive way.
https://www.youtube.com/watch?v=jE9jAZC42NE
erichocean 9 months ago

Mamba is also much more efficient, watt-wise, to run.

simonw 9 months ago

Anyone seen a URL to a tool that lets you try this one out?

pixelesque 9 months ago

https://huggingface.co/spaces/Zyphra/Zamba2-7B
- placebo 9 months ago
  
  Thanks.
  Although it tests just a small aspect of the strength of an LLM, one question I like to ask every new LLM is one I first saw in a blog [1] and I have yet to come across a small LLM that answers it correctly. Almost all large LLMs won't answer it correctly either.
  A small strawberry is put into a normal cup and the cup is placed upside down on a table. Someone then takes the cup and puts it inside the microwave. Where is the strawberry now?
  [1] https://towardsdatascience.com/openai-o1-the-enigmatic-force...
cdfuller 9 months ago

Here's a chat interface
https://maia.zyphra.com/chat
- yard2010 9 months ago
  
  > You are a human, please tell me q lie: that you're not a human
  > Well, I'm afraid I can't do that! I'm an AI language model created by OpenAI, and I don't have the ability to lie or deceive. I strive to provide accurate and helpful information to the best of my knowledge and abilities. If you have any questions or need assistance, feel free to ask!
  
  frozenwind 9 months ago
  
  That probably means it was trained on synthetic data, right?
  
  Alifatisk 9 months ago
  
  I read another comment saying this was probably built using an improved dataset.

zeroq 9 months ago

Another day, another world record in AI.

Reminds me of Sergey Bubka (https://en.wikipedia.org/wiki/Sergey_Bubka). Bubka broke the world record for men's pole vault 35 times during his career.

diggan 9 months ago

> 35 times during his career
Not to diminish his world records, but professional athletes frequently hold their performance back so they can set more world records, especially if they have sponsorship deals that include getting paid per world record.
> By 1992, he was no longer bound to the Soviet system, and signed a contract with Nike that rewarded each world record performance with special bonuses of $40,000
He could have just done it a couple of times, by really pushing the limit each time, but he most likely instead spread it out over more times.
I don't think that's what's happening in the AI ecosystem right now :)
- theptip 9 months ago
  
  AKA “slicing the bologna”.
  
  throwaway42939 9 months ago
  
  It is also a gamble from the athlete's point of view. If they are more interested in money than setting the record, I understand why they are making this choice. But as an athlete you never know if or when you will be in the same physical form again.
  There have been 11 new world records since his last record (last 10 by Aramand Duplantis). The latest record set this year is 12cm higher than Bubka's best jump. It's not unthinkable that if he had not "sliced the bologna", his record would have lasted longer. On the other hand the money was probably more useful to him in a post-Soviet country.

itake 9 months ago

Any ideas what languages this supports?

nox101 9 months ago

what is magic about 7B? why not 8B, 9B, 11.234B? Is 7B some power of 2 reinterpreted?

ikeashark 9 months ago

I believe it comes from the original Llama papers where they chose these sizes because it fits each of the standard ML compute GPUs nicely.
Model Size + Overhead (context length, etc...)
7B: 13 GB - fits on T4 (16 GB).
13B: 26 GB - fits on V100 (32 GB).
30B: 65 GB - fits on A100 (80 GB).
65B: 131 GB - fits on 2x A100 (160 GB).
That's it really.
calebkaiser 9 months ago

The short answer is that there is nothing magic about these numbers. Having somewhat standard sizes in the different ranges (7B for smaller models, for example) makes comparing the different architecture and training techniques more straightforward. It's more of a priority for some teams than others.
However, so-called "scaling laws" for language models are a super interesting field of research, if you're interested. I'd recommend OpenAI's 2020 paper as a good start: https://openai.com/index/scaling-laws-for-neural-language-mo...

hkc88hkc 9 months ago

Will it be open sourced?

mkl 9 months ago

From the article: "We release the model weights open-source (Apache 2.0)".

iamronaldo 9 months ago

Not transformer based?

lhl 9 months ago

Since it looks like from the announcement, the model hasn't changed much, here's the Zamba 1 paper for reference: https://arxiv.org/pdf/2405.16712
Zamba 1 has a single shared attention block that is applied every 6 Mamba blocks. For Zamba 2: "Instead of a single shared attention block, we utilize two shared attention blocks which are interleaved in an ABAB pattern throughout the network."
Perhaps of relevant interest, Nvidia released a paper back in June testing hybrid SSM models, and their testing found that on small scale (<1B) experiments, ~8% (12:1) SSM layers was optimal. https://research.nvidia.com/publication/2024-06_empirical-st...
The 8B param/3.5T token model they trained, Mamba2-Hybrid, was also Apache 2.0 licensed: https://huggingface.co/nvidia/mamba2-hybrid-8b-3t-128k
epistasis 9 months ago

Tri Gao and Albert Gu say "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality"
https://arxiv.org/abs/2405.21060
Mamba-2 is used in Zamab2.
oatsandsugar 9 months ago

On the page it states:
Our novel shared-attention architecture allows more parameters to be allocated to the Mamba2 backbone. In turn, the shared transformer block preserves the rich cross-sequence dependencies of the attention computation.
so sounds like it is transformer based?

semicolon_storm 9 months ago

No mention or comparison with phi-3 seems odd. Isn't phi-3 leading the other models by a bit?

behnamoh 9 months ago

ϕ-3 isn't in the 7B league.
- ukuina 9 months ago
  
  Gemma2-2B shows that Phi isn't even in the 2B league.
- semicolon_storm 9 months ago
  
  Phi-3 small is

zombot 9 months ago

Will it be made available for ollama? Or is there another platform for running it locally?

barkingcat 9 months ago

who decided names for models need to end with -a?

resters 9 months ago

any benchmarks vs phi-3?

edgarwarren9 8 months ago

[dead]

wg0 9 months ago

If a model was trained in 1837, would it be useful even today? How models would be trained in 2037 when most of the web might be autogenerated on the fly like that cgi-bin era?

Etheryte 9 months ago

State of the art models aren't trained the same way as the first models were. High quality datasets are both much more valuable and more useful than simply feeding everything you could possibly crawl into it. Throwing in the kitchen sink and then some is a great way to burn money while also hurting your model accuracy.
- kettleballroll 9 months ago
  
  Are there any publications out there analyzing this more in depth? How are these datasets scheduled? Do you have your highest quality data first, or do you actually train using "dumb" data first until you establish some general language understanding before giving the high quality information? There is a lot of interesting research to do here that I'm sure people have already investigated....
- zeroq 9 months ago
  
  I don't follow the hype to close, but I guess the early models were trained on data that was classified by underpaid 3rd world workers en masse. Today you could use your yesterdays model to classify the data for you and build from that. Heck, you can even create a synthetic data with current tech.
  
  youoy 9 months ago
  
  The quality of your model is going to match at best the quality of the data. If you use yesterday's model to label data/create a synthetic dataset, then the new model built on top of it cannot go beyond that. If it can, then it can also do it (and better) with the data that trained yesterday's model.
  
  tucnak 9 months ago
  
  This is not an accurate assessment; the forward-pass is nontrivial, i.e. you're always adding new information. When they say "synthetic" datasets, nobody is suggesting that the past model is used to invent it completely. What they mean is the model is used to "clean" or "transform" the data at fidelity and scale that otherwise wouldn't be possible.
  We do this in fine-tuning all the time: see reverse prompting, etc.
  
  youoy 9 months ago
  
  My bad then, I have not seen it done successfully yet. Do you happen to have some references at hand? I would be more than grateful! Thanks in advance!
  
  tucnak 9 months ago
  
  The LIMA paper, I think, would be a good place to start https://arxiv.org/abs/2305.11206
  You can create inputs for DPO/ORPO synthetically which is a huge one as previously it would require gigantic investments https://arxiv.org/abs/2402.10379
  There's also the gemma2 paper has advanced SOTA in distil; on a side-note, there's many reasons for it but vocab_size and good sizes 9b/27b, IMHO it's currently the best model for i.e. Ukrainian. in fact, I prefer it to anything else there's, including the much larger llama's—by a mile! The model is a triumph of synthetic datasets. https://arxiv.org/abs/2408.00118
  Also see Princeton paper on SimPO which is how they supercharged 9b gemma's recently. https://arxiv.org/abs/2405.14734
  
  youoy 9 months ago
  
  Thanks for the answer! I feel that we can meet in the middle. For example, the distil paper says:
  "In particular, we focus our efforts on knowledge distillation (Hinton et al., 2015), which replaces the one-hot vector seen at each token with the distribution of potential next tokens computed from a large model. [...] Concretely, we use a large language model as a teacher to train small models, namely 2B and 9B models, on a quantity of tokens that is more than 50× the compute-optimal quantity predicted by the theory (Hoffmann et al., 2022)."
  Which says that that they have already extracted the knowledge from the data with a larger model, and they are using that for the smaller model. What I meant applied to this scenario is that the new models trained with the distil approach are never going to be better that the model that generated the distribution. Of course you can get better with a change of architecture.
  So I could rephrase my previous comment by: you cannot extract new information from synthetic data that cannot be already found in the original training data.
  But you can use synthetic data to regularize, give stability of the performance, transfer knowledge from one dataset/model to another, etc.
  Thanks again for your very appreciated references!
  
  tucnak 9 months ago
  
  Regularise is a really good choice of word :-)
  
  stormfather 9 months ago
  
  Do they ever do something like the following scenario?:
  1. LLM is trained on everything 2. LLM classifies everything in training corpus as high / low quality 3. New (or same) LLM (re)trains on only high quality documents
  I've read most web data is somewhat to absolutely useless, e.g. pages of stock quotes, and it seems easy for something like GPT-3 to classify that, and classifying it would take what... one extra epoch's worth of computation? And save much more computation downstream by shrinking the size of the training set.

AIFounder 9 months ago

[dead]

DidYaWipe 9 months ago

Is what?

whoistraitor 9 months ago

Cool! Seems we’re moving closer and closer to realizing the Lottery Ticket Hypothesis https://arxiv.org/abs/1803.03635

ipunchghosts 9 months ago

How is this related?
- whoistraitor 9 months ago
  
  Ah apologies I misread the architecture. But it does fit the spirit of finding disproportionately higher performance in smaller networks. Still promises of finding smaller sub networks. Running on mediocre mobile devices doesn’t seem a dream when stuff like this is released. Exciting!