semicolon_storm 7 minutes ago

No mention or comparison with phi-3 seems odd. Isn't phi-3 leading the other models by a bit?

potatoman22 an hour ago

I wonder how much of the performance gains can be attributed to their improved dataset rather than their architecture. That would be an expensive experiment.

itake 11 minutes ago

Any ideas what languages this supports?

whoistraitor 36 minutes ago

Cool! Seems we’re moving closer and closer to realizing the Lottery Ticket Hypothesis https://arxiv.org/abs/1803.03635

  • ipunchghosts 35 minutes ago

    How is this related?

    • whoistraitor 25 minutes ago

      Ah apologies I misread the architecture. But it does fit the spirit of finding disproportionately higher performance in smaller networks. Still promises of finding smaller sub networks. Running on mediocre mobile devices doesn’t seem a dream when stuff like this is released. Exciting!

simonw 27 minutes ago

Anyone seen a URL to a tool that lets you try this one out?

SubiculumCode an hour ago

When they say that they use two attention heads, are each attention head directed at different aspects of the data?

In memory research there is this idea that there is a dual representation of every event...a more verbatim representation, and more context weighted representation. As we develop over early childhood, our verbatim memory representations increase in fidelity and strength against interference, but peaks around 6 to 10 years, depending on the specifics. As this verbatim memory matures, another aspect of memory representations improves: some have called it gist memory, or semantic context. Increases in memory performance continue into adolescence primarily due to increases in the ability to use context and gist (broad representations that capture the details by inference or an event) to increase accuracy overall, but also greater likelihood of committing false alarms to lures primed by semantically related material during learning...expressly because there becomes greater reliance on context to support recall accuracy.

So I could imagine such a system in a LLM where attention is directed to exact representations in one head, and another that keeps its attention on a coarser grain of information that anchors information. However, I am not that familiar with LLMs to know if that is just silly analogizing.

iamronaldo 2 hours ago

Not transformer based?

  • lhl 2 hours ago

    Since it looks like from the announcement, the model hasn't changed much, here's the Zamba 1 paper for reference: https://arxiv.org/pdf/2405.16712

    Zamba 1 has a single shared attention block that is applied every 6 Mamba blocks. For Zamba 2: "Instead of a single shared attention block, we utilize two shared attention blocks which are interleaved in an ABAB pattern throughout the network."

    Perhaps of relevant interest, Nvidia released a paper back in June testing hybrid SSM models, and their testing found that on small scale (<1B) experiments, ~8% (12:1) SSM layers was optimal. https://research.nvidia.com/publication/2024-06_empirical-st...

    The 8B param/3.5T token model they trained, Mamba2-Hybrid, was also Apache 2.0 licensed: https://huggingface.co/nvidia/mamba2-hybrid-8b-3t-128k

  • epistasis 2 hours ago

    Tri Gao and Albert Gu say "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality"

    https://arxiv.org/abs/2405.21060

    Mamba-2 is used in Zamab2.

  • oatsandsugar 2 hours ago

    On the page it states:

    Our novel shared-attention architecture allows more parameters to be allocated to the Mamba2 backbone. In turn, the shared transformer block preserves the rich cross-sequence dependencies of the attention computation.

    so sounds like it is transformer based?

wg0 an hour ago

If a model was trained in 1837, would it be useful even today? How models would be trained in 2037 when most of the web might be autogenerated on the fly like that cgi-bin era?

  • Etheryte an hour ago

    State of the art models aren't trained the same way as the first models were. High quality datasets are both much more valuable and more useful than simply feeding everything you could possibly crawl into it. Throwing in the kitchen sink and then some is a great way to burn money while also hurting your model accuracy.

    • zeroq 15 minutes ago

      I don't follow the hype to close, but I guess the early models were trained on data that was classified by underpaid 3rd world workers en masse. Today you could use your yesterdays model to classify the data for you and build from that. Heck, you can even create a synthetic data with current tech.