CATEGORY · 22 COMPANIES
Data Labeling / Training Data
Human-feedback, annotation, and training-data operations.
#CompanyValuationARRHeadcountFresh
01UUnstructuredUnstructured provides a GenAI data pipeline platform that extracts, transforms, and loads unstructured data (documents, PDFs, images, emails) into AI-ready format. The platform processes 65+ file types and integrates with enterprise data warehouses, vector databases, and LLM workflows with built-in compliance and governance features.———green02SScale AILeading data-labeling and training-data provider, now 49% owned by Meta.—~$2B projected revenue 2025 (+130% YoY)Not separately reported—03LLabelboxAnnotation and RLHF platform————04EExtendExtend provides production-grade document processing infrastructure with specialized vision models for parsing, extracting, and splitting complex documents. The platform enables technical teams to build accurate document pipelines in days rather than months, serving enterprises including Chime, Brex, Flatiron Health, and Opendoor.——25 (as of Winter 2023)green05GGiskardGiskard is an AI red-teaming and LLM security platform automating vulnerability detection and testing. The platform conducts black-box testing of conversational AI agents to identify hallucinations, biases, prompt injection risks, and security flaws using domain-specific test case generation and proactive monitoring.———green06AAirtrain AIAirtrain AI is a no-code platform for fine-tuning and evaluating large language models. The platform enables AI developers to compare 20+ open-source and proprietary LLMs, batch-evaluate models against test datasets (up to 10,000 examples), and fine-tune models with custom data to reduce costs by up to 90% versus proprietary APIs.—~$1.7M (2024)~11green07OOpenlayerEnterprise platform for AI evaluation and governance supporting traditional ML and generative AI systems. Provides tools for testing, monitoring, data quality checks, compliance, and model evaluation from development through production.——~23green08LLAION<cite index="37-11,37-12">LAION (Large-scale Artificial Intelligence Open Network) was founded in 2021 with the goal of increased access to quality, AI-friendly datasets and curates and releases massive openly-licensed datasets for AI/ML usage.</cite> <cite index="38-2">It is a German non-profit organization.</cite>———green09UUnstractUnstract is an open-source, no-code platform for agentic AI document processing that automates extraction, classification, and analysis of documents at scale. The platform uses LLMs with built-in safeguards (LLMChallenge consensus mechanism) to eliminate hallucinations and ensure production-grade accuracy for compliance-heavy workflows.———green10FForetellixForetellix specializes in scenario-based verification and validation (V&V) for autonomous vehicles and ADAS, providing safety-focused simulation and testing frameworks for automotive and industrial automation.estimated $300M+—~100green11DDavid AIDavid AI is an audio data research company building the data layer for voice AI. The startup creates high-quality, full-duplex channel-separated speech training datasets that address the critical bottleneck in voice AI model development, supporting cutting-edge voice production systems and frontier research.———green12IInvisible TechnologiesAI training operations————13SSurge AIBootstrapped data-labeling leader————14TTolokaCrowdsourced data labeling————15CClarifaiClarifai is an end-to-end AI platform specializing in computer vision, natural language processing, and audio recognition. It provides tools for data labeling, model training, evaluation, and inference, supporting the full AI lifecycle with pre-trained models and custom model development for image, video, text, and audio data.———green16SSnorkel AIProgrammatic data labeling————17AArgillaArgilla is an open-source data labeling platform for Natural Language Processing that empowers data and machine learning teams to build and monitor high-quality training data. The platform provides human-in-the-loop and programmatic labeling features, and was acquired by Hugging Face in June 2024.———green18KKili TechnologyCollaborative data annotation platform combining image, video, text, audio, and OCR labeling with AI-assisted automation, data curation, and DataOps workflows. Provides both self-service tooling and managed expert labeling services for enterprise AI projects.———green19UUnderstand.aiAI data labeling and annotation platform specialized for autonomous driving and automotive computer vision. Provides crowdsourced and AI-assisted image annotation for training autonomous vehicle perception systems.Undisclosed—50-100 (estimated)green20DDataloopEnd-to-end AI development platform covering annotation (image, video, LiDAR), data QA/verification, workforce and project management, and generative AI model building. Combines data labeling tools with production ML pipeline management and automation for computer vision.———green21FFlittoCrowdsourcing translation and data labeling platform leveraging AI for multilingual content processing. Provides human-in-the-loop AI data annotation services across 100+ languages for training and fine-tuning large language models.———green22MMicro1AI-driven data labeling platform that combines recruiting and managed labeling services. Uses AI agents to rapidly vet and onboard expert annotators for high-quality data work. Focuses on domain expert labeling rather than crowdsourcing.$500M~$50M—green