AlbertoGP 4 days ago

> [...] self-hostable solution that leverages state-of-the-art (SOTA) vision models for segment extraction and OCR, unifying the output through a Rust Actix server. This setup allows you to process PDFs and extract segments at an impressive speed of approximately 5 pages per second on a single NVIDIA L4 instance, offering a cost-effective and scalable solution for high-accuracy bounding box segment extraction and OCR. This solution has models that accommodate for both GPU and CPU environments.

oliwarner 4 days ago

> To use Chunkr privately without complying to the AGPL-3.0 license terms you can contact us

AGPL has no bearing on how I use software, only how I can redistribute it. AGPL does not stop a person or company using Chunkr or its product in a commercial environment without further license.

  • Onavo 4 days ago

    Yes but most of the time in tools handling PDFs they are usually through the network. Would be a minefield if legal says the AGPL will affect every single microservice that interacts with your service. This is why there is a blanket AGPL ban at most tech companies. The AGPL is effectively an EULA.

    • oliwarner 4 days ago

      First, the quote talks about what I do privately. AGPL explicitly encourages me to do whatever the hell I like with it. I don't need another license.

      But broader, the interpretation that AGPL microservices are viral is just one interpretation. If it really is just a swappable backend interface, why should it affect other subsystems? IANAL but it seems pretty trivial to insulate a microservice with the same sort of GPL condom companies ship to avoid "linking" to (eg) the Kernel.

      https://medium.com/swlh/understanding-the-agpl-the-most-misu...

      • raffraffraff 3 days ago

        I laughed at "GPL condom". Well done.

kybernetikos 4 days ago

It'd be great to see some examples on the web site.

  • creer 3 days ago

    Probably deserves a paper too really.

infecto 4 days ago

Like all of these startups in this space there never is a comparison of output being made between them and the ($$) competition. I realize they are doing some segmentation in the workflow but imo the valuable part is the actual document text and table extraction piece. Textract in its cheapest and simplest form is cheaper than this service. Turning on tables Textract is more expensive but I would be curious if Textract is doing a better job.

  • mistrial9 4 days ago

    what you want is work in itself.. who pays the reviewer? How do you discover the reviewer? secondly, why must there be one "winner" .. maybe there are niches, local markets, business groups.. they want something and someone provides it.

    • infecto 4 days ago

      Huh? This is a company selling a product/service. I am saying they have done no job to compare themselves to the competition beyond saying they are expensive and I am arguing that the competition is not much more expensive and might offer superior quality.

ollivera 4 days ago

Initially, I didn’t like having the tables as images, but using GPT Vision might be a more accurate way to obtain the markdown. I was also considering using the Adobe Extraction API to extract markdown from the CSV file. So, I will try your API over the weekend and see the results.

saaaaaam 4 days ago

Although the docs say “get started by creating an account on chunkr.ai” there doesn’t seem to be any way to create an account.