Cool work! Correct me if I'm wrong, but I believe to use the new OpenAI structured output that's more reliable, the response_format should be "json_schema" instead of "json_object". It's been a lot more robust for me.
I may be reading the documentation wrong [0], but I think if you specify `json_schema`, you actually have to provide a schema. I get this error when I do `response_format={"type": "json_schema"}`:
I hadn't used OpenAI for data extraction before the announcement of Structured Outputs, so not sure if `type: json_object` did something different before. But supplying only it as the response format seems to be the (low effort) way to have the API infer the structure on its own
function calling provides a "hint" in the form of a JSON schema for an LLM to follow. the models are trained to follow provided schemas. If you have really complicated or deeply nested models, they can become less stable at generating schema-conformant JSON.
Structured outputs apply a context-free grammar to the prediction generation so that, for each token generation, only tokens that generate a perfectly conformant JSON schema are considered.
The benefit of doing this is predictability, but there's a trade-off in prediction stability; apparently structured output can constrain the model to generate in a way that takes it off the "happy path" of how it assumes text should be generated.
Happy to link you to some papers I've skimmed on it if you're interested!
Structured output uses "constrained decoding" under the hood. They convert the JSON schema to a context free grammar so that when the model samples tokens, invalid tokens are masked to have a probability of zero. It's much less likely to go off the rails.
Made a small project to help extract structure from documents (pdf,jpg,etc -> JSON or CSV): https://datasqueeze.ai/
There's 10 free pages to extract if anyone wants to give it a try. I've found that just sending a pdf to models doesn't extract it properly especially with longer documents. Have tried to incorporate all best practices into this tool. It's a pet project for now. Lmk if you find it helpful!
I've been using this to OCR some photos I took of books and it's remarkable at it. My first pass was just a loop where I'd OCR, feed the text to the model and ask it to normalize into a schema but I found out just sending the image to the model and asking it to OCR and turn it into the shape of data I wanted was so much more accurate.
We used GPT 4o for more or less the same stuff. Got a boatload of scanned bills we had to digitize, and GPT really nailed the task. Made a schema, and just fed the model all the bills.
> Note that this example simply passes a PNG screenshot of the PDF to OpenAI's API — results may be different/more efficient if you send it the actual PDF.
or convert PDF to image and send that. We’ve done it for things that textract completely mangled, but sonnet has no problem. Especially tables built out of text characters from very old systems
Stuff like this shows how much better the commercial models are than local models. I’ve been playing around with fairly simple structured information extraction from news articles and fail to get any kind of consistent behavior from llama3.1:8b. Claude and chatGPT do exactly what I want without fail.
OpenAI stopped releasing information about their models after gpt-3, which was 175b, but the leaks and rumours that gpt-4 is an 8x220 billion parameter model are most certainly correct. 4o is likely a distilled 220b model. Other commercial offerings are going to be in the same ballpark. Comparing these to llama 3 8b is like comparing a bicycle or a car to a train or cruise ship when you need to transport a few dozen passengers at best. There are local models in the 70-240b range that are more than capable of competing with commercial offerings if you're willing to look at anything that isn't bleeding edge state of the art.
You raise a valid point, but 4o is way smaller than 405B. And 4o mini that's described in the article is highly likely <30B (if we're talking dense models).
<< Stuff like this shows how much better the commercial models are than local models.
I did not reach the same conclusion so I would be curious if you could provide rationale/basis for your assessment in the link. I am playing with humble llama3 8b here and results for federal register type stuff ( without going into details ) was good for what I was expecting to be.. not great.
edit: Since you mentioned llama explicitly, could you talk a little about the data/source you are using for your resutls. You got me curious and I want to dig a little deeper.
What a sad state for humanity that we have to resort to this sort of OCR/scrapping instead of the original data being released in a machine readable format in the first place.
Cool work! Correct me if I'm wrong, but I believe to use the new OpenAI structured output that's more reliable, the response_format should be "json_schema" instead of "json_object". It's been a lot more robust for me.
I may be reading the documentation wrong [0], but I think if you specify `json_schema`, you actually have to provide a schema. I get this error when I do `response_format={"type": "json_schema"}`:
I hadn't used OpenAI for data extraction before the announcement of Structured Outputs, so not sure if `type: json_object` did something different before. But supplying only it as the response format seems to be the (low effort) way to have the API infer the structure on its own[0] https://platform.openai.com/docs/guides/structured-outputs/s...
I’ve been using jsonschema since forever with function calling. Does structured output just formalize things?
function calling provides a "hint" in the form of a JSON schema for an LLM to follow. the models are trained to follow provided schemas. If you have really complicated or deeply nested models, they can become less stable at generating schema-conformant JSON.
Structured outputs apply a context-free grammar to the prediction generation so that, for each token generation, only tokens that generate a perfectly conformant JSON schema are considered.
The benefit of doing this is predictability, but there's a trade-off in prediction stability; apparently structured output can constrain the model to generate in a way that takes it off the "happy path" of how it assumes text should be generated.
Happy to link you to some papers I've skimmed on it if you're interested!
Structured output uses "constrained decoding" under the hood. They convert the JSON schema to a context free grammar so that when the model samples tokens, invalid tokens are masked to have a probability of zero. It's much less likely to go off the rails.
Made a small project to help extract structure from documents (pdf,jpg,etc -> JSON or CSV): https://datasqueeze.ai/
There's 10 free pages to extract if anyone wants to give it a try. I've found that just sending a pdf to models doesn't extract it properly especially with longer documents. Have tried to incorporate all best practices into this tool. It's a pet project for now. Lmk if you find it helpful!
Is this simply the OCR bits to feed to openai structured output?
Similarly I've found old-school OCR is needed for more reliability.
I've been using this to OCR some photos I took of books and it's remarkable at it. My first pass was just a loop where I'd OCR, feed the text to the model and ask it to normalize into a schema but I found out just sending the image to the model and asking it to OCR and turn it into the shape of data I wanted was so much more accurate.
Combining google's ocr with llm gives OCR superpowers. Tell the llm the text is from an ocr and ask it to correct it.
That sounds like it could be very dangerous when the LLM gets it wrong...
tried it works great, ty!
I’m making a free open source library for this, check it at http://github.com/fetchfox/fetchfox
MIT license. It’s just one line of code to get started: ‘fox.run(“get data from example.com”)’
We used GPT 4o for more or less the same stuff. Got a boatload of scanned bills we had to digitize, and GPT really nailed the task. Made a schema, and just fed the model all the bills.
Worked better than any OCR we tried.
> Note that this example simply passes a PNG screenshot of the PDF to OpenAI's API — results may be different/more efficient if you send it the actual PDF.
OpenAI's API only accepts images: https://platform.openai.com/docs/guides/vision
To my knowledge, all the LLM services that take in PDF input do their own text extraction of the PDF before feeding it to an LLM.
or convert PDF to image and send that. We’ve done it for things that textract completely mangled, but sonnet has no problem. Especially tables built out of text characters from very old systems
I don’t think it does OCR. It’s able to use the structure of the PDF to guide the parsing.
Stuff like this shows how much better the commercial models are than local models. I’ve been playing around with fairly simple structured information extraction from news articles and fail to get any kind of consistent behavior from llama3.1:8b. Claude and chatGPT do exactly what I want without fail.
OpenAI stopped releasing information about their models after gpt-3, which was 175b, but the leaks and rumours that gpt-4 is an 8x220 billion parameter model are most certainly correct. 4o is likely a distilled 220b model. Other commercial offerings are going to be in the same ballpark. Comparing these to llama 3 8b is like comparing a bicycle or a car to a train or cruise ship when you need to transport a few dozen passengers at best. There are local models in the 70-240b range that are more than capable of competing with commercial offerings if you're willing to look at anything that isn't bleeding edge state of the art.
The Berkeley Function-Calling Leaderboard tracks function calling/structured data performance from multiple models: https://gorilla.cs.berkeley.edu/leaderboard.html
Llama isn't on there but a few finetunes of it (Hermes) are OSS.
In my tests, Llama 3.1 8b was way worse than Llama 2 13b or Solar 13b.
I mean, those aren't comparable models. I wonder how the 405b version compares.
You raise a valid point, but 4o is way smaller than 405B. And 4o mini that's described in the article is highly likely <30B (if we're talking dense models).
Is the size of OpenAI's models public, or is this guesswork?
If your company has a lot of ex openai employees then you know ;)
And the public numbers are mostly right, the latest values are likely smaller now- they have been working on down sizing everything
<< Stuff like this shows how much better the commercial models are than local models.
I did not reach the same conclusion so I would be curious if you could provide rationale/basis for your assessment in the link. I am playing with humble llama3 8b here and results for federal register type stuff ( without going into details ) was good for what I was expecting to be.. not great.
edit: Since you mentioned llama explicitly, could you talk a little about the data/source you are using for your resutls. You got me curious and I want to dig a little deeper.
What a sad state for humanity that we have to resort to this sort of OCR/scrapping instead of the original data being released in a machine readable format in the first place.