Large language models (LLMs), such as OpenAI’s GPT-4, are the building blocks of a growing number of AI applications. But some businesses are hesitant to adopt them due to lack of access to first-party and proprietary data.
This is certainly not an easy problem to solve – given that this type of data is often behind firewalls and in a format that LLMs cannot take advantage of. But as a relatively new startup, unstructured.ioAttempts are being made to remove barriers with a platform that extracts and stages corporate data in a way that an LL.M. can understand and utilize.
Brian Raymond, Matt Robinson, and Crag Wolfe co-founded Unstructed in 2022 after working at Primer AI, a company focused on building and deploying natural language processing (NLP) solutions for enterprise clients.
“During Primer, we repeatedly encountered bottlenecks in ingesting and preprocessing raw customer files containing NLP data (e.g. PDF, email, PPTX, XML, etc.) and converting them into clean, well-curated files , for consumption. Machine learning models or pipelines,” Unstructed CEO Raymond told TechCrunch in an email interview. “There was no data integration or smart document processing company that could help with this, so we decided to start one to tackle it head-on.”
In fact, data processing and preparation is often a time-consuming step in any AI development workflow.According to one pollData scientists spend nearly 80% of their time preparing and managing data for analysis.Therefore, most data companies Production — about two-thirds — unused by each other polling.
“Organizations generate large amounts of unstructured data every day, and this data combined with an LL.M. can increase productivity. The problem is that this data is fragmented,” continued Raymond. “The dirty secret of the NLP community is that today’s data scientists still have to build handcrafted, one-off data connectors and preprocessing pipelines entirely manually. Unstructured (provides) a natural A comprehensive solution for language data.”
Unstructed provides a number of tools to help cleanse and transform enterprise data for ingestion by LLM, including tools to remove advertisements and other unwanted objects from web pages, concatenate text, perform optical character recognition on scanned pages, and more. The company develops processes for specific types of PDFs; HTML and Word documents, including those used for SEC filings; and most importantly, US Army Officer Evaluation Reports.
To process documents, Unstructed trained its own “file conversion” NLP model from scratch and assembled a series of other models to extract text and about 20 discrete elements (such as headers, headers, and footers) from raw files. Various connectors (about 15 in total) extract documents from existing data sources such as customer relationship management software.
“Behind the scenes, we’re using all kinds of different techniques to abstract away the complexity,” Raymond said. “For example, for old PDFs and images, we use computer vision models. For other file types, we cleverly combine NLP models, Python scripts, and regular expressions.”
Downstream, Unstructed integrates with providers such as LangChain, a framework for creating LLM applications, and vector databases such as Weaviate and MongoDB’s Atlas Vector Search.
Previously, Unstructed’s only product was an open-source suite of these data-crunching tools. Raymond claims the software has been downloaded about 700,000 times and used by more than 100 companies. But to cover development costs, and no doubt appease investors, the company launched a commercial API that can convert data in 25 different file formats, including PowerPoint and JPG.
“We’ve been working with government agencies and we’ve had millions of dollars in revenue in a very short period of time . . . As our focus is on artificial intelligence, it’s important for us to not be affected by the broader economic slowdown market segment,” Raymond said.
Perhaps a product of Raymond’s background, the unstructured organization has unusually strong ties to the defense establishment. Before joining Primer, he was an active member of the US intelligence community, serving in the Middle East, then the White House during the Obama administration, and then the CIA.
Unstructed has received small business contracts from the U.S. Air Force and U.S. Space Force, and is working with U.S. Special Operations Command (SOCOM) to deploy the LL.M. “in conjunction with mission-relevant data.” Additionally, Unstructed’s board includes Michael Groen, a former general and director of the Pentagon’s Joint Artificial Intelligence Center, and Ryan Lewis, who previously led the Department of Defense’s Defense Innovation Unit.
The defensive angle — a solid early revenue stream — could be the deciding factor in Unstructed’s recent funding round. Today, the company announced that it has raised $25 million in Series A funding and a previously undisclosed seed round. Madrona led the Series A round, which included seed round Bain Capital Ventures, M12 Ventures, Mango Capital, MongoDB Ventures and Shield Capital, as well as several angel investors.