Grokking at the edge of linear separability

89 points by marojejian a year ago

I think this means that when training a cat detector it's better to have more bobcats and lynx and fewer dogs.

I feel super confused about this paper.

Apparently their training goal is for the model to ignore all input values and output a constant. Sure.

But then they outline some kind of equation of when grokking will or won't happen, and... I don't get it?

For a goal that simple, won't any neural network with any amount of weight decay eventually converge to a stack of all-zeros matrices (plus a single bias)?

What is this paper even saying, on an empirical level?

whatshisface a year ago

The "neural network" they are using is linear: matrix * data + bias. It's expressing a decision plane. There are two senses in which it can learn the constant classification: by pushing the bias very far away and by contorting the matrix around to rotate all the training data to the same side of the decision plane. Pushing the bias outwards generalizes well to data outside the training set, but contorting the matrix (rotating the decision plane) doesn't.
They discover that the training process tends to "overfit" using the matrix when the data is too sparse to cover the origin in its convex hull, but tends to push the bias outwards when the training data surrounds the origin. It turns out that the probability of the convex hull problem occurring goes from 0 to 1 in a brief transition when the ratio of the number of data points to the number of dimensions crosses 1/2.
They then attempt to draw an analogy between that, and the tendency of sparsely trained NNs to overfit until they have a magic amount of data, at which point they spontaneously seem to "get" whatever it is they're being trained on, gaining the ability to generalize.
Their examples are likely the simplest models to exhibit a transition from overfitting to generalization when the amount of training data crosses a threshold, but it remains to be seen if they exhibit it for similar reasons to the big networks, and if so what the general theory would be. The paper is remarkable for using analytic tools to predict the result of training, normally only obtained through numerical experiments.
- 082349872349872 a year ago
  
  As the origin is special, instead of training in a linear space, what would training in an affine space do?
  
  whatshisface a year ago
  
  I think they are training in an affine space, but I see what you're saying. The initialization of the bias must be breaking the symmetry in a way that makes the origin special. Of course to some degree that's unavoidable since we have to initialize on distributions with compact support.
- PoignardAzur a year ago
  
  > It's expressing a decision plane. There are two senses in which it can learn the constant classification: by pushing the bias very far away and by contorting the matrix around to rotate all the training data to the same side of the decision plane.
  You're saying that like you expect it to be intuitively understandable to the average HN audience. It really isn't.
  What do you mean by "rotate all the training data to the same side of the decision plane"? As in, the S matrix rotates all inputs vector to output vectors that are "on the right side of the plane"? That... doesn't make sense to me; as you point out, the network is linear, there's no ReLU, so the network isn't trying to get data on "the same side of the plane", it's trying to get data "on the plane". (And it's not rotating anything, it's a scalar product, not a matmul. S is one-dimensional.)
  (Also I think their target label is zero anyway, give how they're presenting their loss function?)
  But in any case, linear transformation or not, I'd still expect the weight and bias matrices to converge to zero given any amount of weight decay whatsoever. That's the core insight of the original grokking papers: even once you've overfit, you can still generalize if you do weight decay.
  It's weird that the article doesn't mention decay at all.
  
  whatshisface a year ago
  
  The network is linear, but the loss is ln(1+exp(x)), a soft ReLU.

alizaid a year ago

Grokking is fascinating! It seems tied to how neural networks hit critical points in generalization. Could this concept also enhance efficiency in models dealing with non-linearly separable data?

wslh a year ago

Could you expand about grokking [1]? I superficially understand what it means but it seems more important that the article conveys.
Particularly:
> Grokking can be understood as a phase transition during the training process. While grokking has been thought of as largely a phenomenon of relatively shallow models, grokking has been observed in deep neural networks and non-neural models and is the subject of active research.
Does that paper add more insights?
[1] https://en.wikipedia.org/wiki/Grokking_(machine_learning)?wp...
- tanananinena a year ago
  
  This is probably the most interesting (and insightful) paper on grokking I’ve read recently: https://arxiv.org/abs/2402.15555

diwank a year ago

Grokking is so cool. What does it even mean that grokking exhibits similarities to criticality? As in, what are the philosophical ramifications of this?

hackinthebochs a year ago

Criticality is the boundary between order and chaos, which also happens to be the boundary at which information dynamics and computation can occur. Think of it like this: a highly ordered structure cannot carry much information because there are few degrees of freedom. The other extreme is too many degrees of freedom in a chaotic environment; any correlated state quickly gets destroyed by entropy. The point at which the two dynamics are balanced is where computation can occur. This point has enough dynamics that state can change in a controlled manner, and enough order so that state can reliably persist over time.
I would speculate that the connection between grokking and criticality is that grokking represents the point at which a network maximizes the utility of information in service to prediction. This maximum would be when dynamics and rigidity are finely tuned to the constraints of the problem the network is solving, when computation is being leveraged to maximum effect. Presumably this maximum leverage of computation is the point of ideal generalization.
- soulofmischief a year ago
  
  A scale-free network is one whose degree distribution follows a power law. [0]
  Self-organized criticality describes a phenomenon where certain complex systems naturally evolve toward a critical state where they exhibit power-law behavior and scale invariance. [1]
  The power laws observed in such systems suggest they are at the edge between order and chaos. In intelligent systems, such as the brain, this edge-of-chaos behavior is thought to enable maximal adaptability, information processing, and optimization.
  The brain has been proposed to operate near critical points, with neural avalanches following power laws. This allows a very small amount of energy to have an outsized impact, the key feature of scale-free networks. This phenomenon is a natural extension of the stationary action principle.
  [0] https://en.wikipedia.org/wiki/Scale-free_network
  [1] https://www.researchgate.net/publication/235741761_Self-Orga...
  
  zburatorul a year ago
  
  I can see how scale free systems have their action stay invariant under more transformations. I'd like to better understand the connection with action stationarity/extreme. Can you say more?
  
  soulofmischief a year ago
  
  Simply put, hubs in a scale-free network act as efficient intermediaries, minimizing the overall cost, in terms of action, for communication or interaction between nodes.
  Scale-free networks are robust to random dropout (though not to targeted dropout of hubs) and this serves to stabilize the system. The interplay between stability and stationary action is the key here.
  What follows is my own mathematical inquiry into a generalized stationary action principle, which might provide some intuition. Feel free to correct any mistakes.
  We often define action as the integral of a Lagrangian [0] over time:
  S = ∫ₜ₁ᵗ² Ldt
  Typically the Lagrangian is defined at a specific layer in a hierarchical system. Yet, Douglas Hofstadter famously introduces the concept of "strange loops" in Gödel, Escher, Bach. A strange loop is a cyclic structure which arises within several layers of a hierarchical system, due to inter-layer feedback mechanisms. [1] Layers might be described with respect to their information dynamics. In a brain network each layer might be at the quantum, chemical, mechanical, biological, psychological, etc. scale.
  Thus, we could instead consider total action within a hierarchical system, with each layer xᵢ having a Lagrangian ℒᵢ defined which best captures the dynamics of that layer. We could define total action as a sum of the time integrals of each Lagrangian plus the time integral of a coupling function C(x₁,x₂,...,xₙ). This coupling function captures the dynamics between coupled layers and allows for inter-layer feedback to affect global state.
  So we end up with
  S = ∫ₜ₁ᵗ²(∑ ℒᵢ(xᵢ,ẋᵢ) + C(x₁,x₂,...,xₙ))dt
  Now, when S ≈ 0 it means that each layer in the system has minimized not necessarily its own local action, but the global action of the system with respect to each layer. It is often the case however that scale-free networks exhibit fractal-like behavior and thus tend to be both locally and globally efficient, and structurally invariant under scaling transformations. In a scale-free network, each subnetwork is often itself scale-free.
  We might infer that global stability is the result of stationary action (and thus energy/entropy) management across all scales. Strange loops are effectively paths of least action through the hierarchical system.
  Personally I think that minimization of action at certain well-defined layers might be able to predict the scales at which proceeding layers emerge, but that is beside the point.
  By concentrating connections in a small number of hubs, the system minimizes the overall energy expenditure and communication cost. Scale-free networks can emerge as the most action-efficient structures for maintaining stable interactions between a large number of entities in a hierarchical system.
  A network can be analyzed in this fashion both intrinsically (each node or subnetwork representing a hierarchical layer) or in the context of a larger network within which it is embedded (wherein the network is a single layer). When a network interacts with other networks to form a larger network, it's possible that other non-scale-free architectures more efficiently reduce global action.
  I imagine this is because the Lagrangian for each layer in the hierarchy becomes increasingly complex and at some critical point, goal-oriented (defined here as tending toward a non-stationary local action in order to minimize global action or the action of another layer). Seemingly anomalous behavior which doesn't locally follow the path of least action might be revealed to be part of a larger hierarchical loop which does follow the path of least action, and this accounts for variation in structure within sufficiently complex networks which exhibit overall fractal-like structure.
  Let me know if any of that was confusing or unclear.
  [0] https://en.wikipedia.org/wiki/Lagrangian_mechanics
  [1] https://en.wikipedia.org/wiki/Strange_loop
- Agingcoder a year ago
  
  This looks very interesting. Would you have references ? ( not necessarily on grokking but about the part where computation can occur only when the right balance is found )
  
  hackinthebochs a year ago
  
  Hard to pin down a single citation of that point. But some good places to start are:
  https://en.wikipedia.org/wiki/Critical_brain_hypothesis
  https://journals.aps.org/pre/abstract/10.1103/PhysRevE.79.04... (on sci-hub)

kouru225 a year ago

And winner of Best Title of the Year goes to:

bbor a year ago

I'm glad I'm not the only one initially drawn in by the title! As the old meme goes;
> If you can't describe your job in 3 Words, you have a BS job:
> 1. "I catch fish" Real job!
> 2. "I drive taxis" Real job!
> 3. "I grok at the edge of linear separability" BS Job!
- sva_ a year ago
  
  > ai researcher
  
  o11c a year ago
  
  Amazingly, 2/5 LLMs I asked consistently (I only tested a few times) gave a reasonable answer (usually "two", but occasionally "three") and explanation for: How many words in "AI researcher?"
  "Four" is completely bogus no matter how you measure it, even if it's in a list of alternatives. Also the word "engineer" definitely isn't in there. "researcher" is present, and "er" isn't even a word!
  I'm of two minds on the one that explicitly argued (without prompting) that the question mark counts as a word. But it failed other times anyway.
  
  cryptonector a year ago
  
  Five is right out.

bbor a year ago

Wow, fascinating stuff and "grokking" is news to me. Thanks for sharing! In typical HN fashion, I'd like to come in as an amateur and nitpick the terminology/philosophy choices of this nascent-yet-burgeoning subfield:

  We begin by examining the optimal generalizing solution, that indicates the network has properly learned the task... the network should put all points in Rd on the same side of the separating hyperplane, or in other words, push the decision boundary to infinity... Overfitting occurs when the hyperplane is only far enough from the data to correctly classify all the training samples.

This is such a dumb idea on first glance, I'm so impressed that they pushed past that and used it for serious insights. It truly is a kind of atomic/fundamental/formalized/simplified way to explore overfitting on its own.

Ultimately their thesis, as I understand it from the top of page 5, is roughly these two steps (with some slight rewording):

  [I.] We call a training set separable if there exists a vector [that divides the data, like a 2D vector from the origin dividing two sets of 2D points]... The training set is almost surely separable [when there's twice as many dimensions as there are points, and almost surely inseparable otherwise]...

Again, dumb observation that's obvious in hindsight, which makes it all the more impressive that they found a use for it. This is how paradigm shifts happen! An alternate title for the paper could've been "A Vector Is All You Need (to understand grokking)". Ok but assuming I understood the setup right, here's the actual finding:

  [II.] [Given infinite training time,] the model will always overfit for separable training sets[, and] for inseparable training sets the model will always generalize perfectly. However, when the training set is on the verge of separability... dynamics may take arbitrarily long times to reach the generalizing solution [rather than overfitting]. 
  **This is the underlying mechanism of grokking in this setting**.

Or, in other words from Appendix B:

  grokking occurs near critical points in which solutions exchange stability and dynamics are generically slow

Assuming I understood that all correctly, this finally brings me to my philosophical critique of "grokking", which ends up being a complement to this paper: grokking is just a modal transition in algorithmic structure, which is exactly why it's seemingly related to topics as diverse as physical phase changes and the sudden appearance of large language models. I don't blame the statisticians for not recognizing it, but IMO they're capturing something far more fundamental than a behavioral quirk in some mathematical tool.

Non-human animals (and maybe some really smart plants) obviously are capable of "learning" in some human-like way, but it rarely surpasses the basics of Pavlovian conditioning: they delineate quantitative objects in their perceptive field (as do unconscious particles when they mechanically interact with each other), computationally attach qualitative symbols to them based on experience (as do plants), and then calculate relations/groups of that data based on some evolutionarily-tuned algorithms (again, a capability I believe to be unique to animals and weird plants). Humans, on the other hand, not only perform calculations about our immediate environment, but also freely engage in meta-calculations -- this is why our smartest primate relatives are still incapable of posing questions, yet humans pose them naturally from an extremely young age.

Details aside, my point is that different orders of cognition are different not just in some quantitative way, like an increase in linear efficiency, but rather in a qualitative--or, to use the hot lingo, emergent--way. In my non-credentialed opinion, this paper is a beautiful formalization of that phenomenon, even though it necessarily is stuck at the bottom of the stack so-to-speak, describing the switch in cognitive capacity from direct quantification to symbolic qualification.

It's very possible I'm clouded by the need to confirm my priors, but if not, I hope to see this paper see wide use among ML researchers as a clean, simplified exposition of what we're all really trying to do here on a fundamental level. A generalization of generalization, if you will!

Alon, Noam, and Yohai, if you're in here, congrats for devising such a dumb paper that is all the more useful & insightful because of it. I'd love to hear your hot takes on the connections between grokking, cognition, and physics too, if you have any that didn't make the cut!

anigbrowl a year ago

It's just another garbage buzzword. We already have perfectly good words for this like understanding and comprehension. The use of grokking is a a form of in-group signaling to get buy-in from other Cool Kids Who Like Robert Heinlein, but it's so obviously a nerdspeak effort at branding that it's probably never going to catch on outside of that demographic, no matter how fetch it is.
- kaibee a year ago
  
  > It's just another garbage buzzword. We already have perfectly good words for this like understanding and comprehension.
  Yeah, try telling people that NNs contain actual understanding and comprehension. That won't be controversial at all.
  
  anigbrowl a year ago
  
  I'm fully aware that most people disagree with that idea, although I myself think we're not very removed from LLMs at all, and there's no fundamental barrier to machine consciousness.
  While that may be an unpopular opinion at present, and more so outside of the technical/academic worlds, trying to market the same idea by giving it a vaguely cool new name is asinine in my view. I don't see how its any different from some entrepreneurially minded physicist trying to get attention by writing papers about magnetism but calling it 'The Force' instead to build a following of Star Wars fans.
  It's not that I dislike Heinlein or anything, I'm rather a fan actually. But trying to juice up research with cool sci-fi references is cringe, and when I see it I reflexively discount the research claim because of the unpleasant feeling that it's a sales pitch in disguise.