Building the science
of Training Data.

Quality human data is running out. Over half of web text is now AI-generated. Datoric is an AI research lab studying how data composition, expert reasoning, and multi-modal structure drive the next wave of model capabilities.

Our Mission

The next breakthrough in AI will come from better data, not bigger models.

Frontier labs are spending over $1B per year each on human data. Quality human text, roughly 300 trillion tokens, will be fully consumed by 2027. Meanwhile, synthetic contamination is degrading the open web as a training source.

The science of what makes training data effective remains poorly understood. Why do expert reasoning traces produce outsized capability gains? How does multi-modal composition affect emergence? Why do models collapse when trained on their own outputs?

We founded Datoric to answer these questions and build the research foundations for the post-data-wall era of AI.

Research Focus Areas

Understanding the foundations of AI capability.

The Data Wall

Quality human text is approaching exhaustion at ~300T tokens, while over half of web content is now AI-generated. We study how data scarcity and synthetic contamination reshape the training landscape.

Expert Reasoning Traces

The highest-value training signal isn't answers. It's the reasoning behind them. We research how professional decision traces in medicine, law, and engineering transfer to model capability.

Multi-Modal Composition

How do paired modalities like voice, vision, text, and sensor data interact during training? We investigate cross-modal data composition and the capability emergence it produces.

Agentic Data Systems

No collection infrastructure exists for agent training data. We study tool-use traces, error recovery patterns, and multi-step task completions, where small seed datasets yield outsized capability gains.

Human-Synthetic Flywheels

Models trained purely on synthetic outputs collapse. We research the optimal interplay between human-sourced data and synthetic augmentation: the flywheel that drives frontier performance.

Multilingual Representation

Frontier models fail dramatically on underrepresented languages, scoring as low as 35% on native-speaker benchmarks. We study how linguistic diversity and cultural context shape model behavior at scale.

Our Approach

Empirical, rigorous, applied.

01

Map distributional gaps

We identify where models systematically fail, from 35% accuracy on underrepresented language benchmarks to missing decision traces in professional domains like debugging, diagnosis, and legal reasoning.

02

Design data interventions

We construct targeted datasets with domain practitioners: expert reasoning traces, multi-modal paired data, and agentic task demonstrations that encode the knowledge synthetic generation cannot replicate.

03

Measure and publish

Every intervention is evaluated against domain-specific benchmarks, not generic leaderboards. We publish our findings on data composition, human-synthetic flywheels, and capability emergence.

Teams

Our research is organized around four core teams.

Data Science studies the theory of training data quality: measuring signal density, mapping distributional gaps, and understanding why models trained on expert reasoning traces outperform those trained on volume alone.

Multi-Modal Systems investigates how models learn from heterogeneous data, studying cross-modal transfer, paired data composition, and the 50-100% capability premiums that multi-modal training produces over single-modality approaches.

Agentic Intelligence researches the data infrastructure for AI agents: tool-use traces, error recovery patterns, and GUI interaction data, where 312 human demonstrations can be augmented to 27,000 training instances with 141% capability improvement.

Evaluation & Benchmarks builds domain-specific evaluation frameworks that expose failures generic benchmarks miss, measuring model performance in professional contexts across medicine, law, finance, and multilingual settings.

Research Notes

Where frontier models break, and what we are doing about it.

Every paper measures a specific failure pattern in a specific domain. We publish the full methodology, the dataset, and the statistical significance tests alongside every claim.

%%%%%%%%#-::::-----======-----+#%%%%%%%%%%%#######
%%%%%%%%+:-----------------++#%%%%%%%%%%%%%%######
%%%%%%%+---::-----------==#%%%%%%%%%%%%%%%%%%#####
%%%%%%+---:::::::------+##%%%%%%%%%%%%%%%%%%######
%%%%%#-=---=++*#%#++=-=%%%%%%%%%%%%%%%%%%%%%%%####
%%%%%%%%%%%%%%%%%#+==-#%%%%%%%%%%%%%%%%%%%%%%%%%##
%%%%%%%%%%%%%%%%%*+**%%%%%%%%%%%%%%%%%%%%%%%%%%%%#
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%#
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%#%%%%%%%%%%%%##+=-=*
#%%%%%%%%%%%%%%%%%%%%%%%%%%%%%=*%%%%%%#####**=-:-*
+*%%%%%%%%%%%%%%%%%%%%%%%##%%#--#%%%####*#********
*#%%%%%%%%%%%%%%%%%%%%%%%+*%*:-:-###%%%%%%%#######
*#%%%%%%%%%%%%%%%%%%%%%%###*:=+-:..-#%%%%%%%%%%%%%
#%%%%%%%%%%%%%%%%%%%%%%%**+::....:=##%%###%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%*+*-:....:=*%#**#%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%*=--:::::-=##++##%%%%%%%%%
%%%%%%%%%%%%%%####%%%%%%*---:::::--+=+*##%%%%%%%%%
%%%%%%%%%%*+-:.:::-=#%%%#---:------+**##%%%%%%%%%%
%%%%#+-::...::::::--=*#%%#=------=**###%%%%%%%%%%%
#%%+:....:-+-::::---+*##%%%=---=#%%%%%%%%%%%%%##%%
#%%::::::--------=-=+++*#%#%%%#+####%%%%%#*%%%%%##
%%%*:::---====--===--=+**#%%%%%%##******+-+%%%%%#%
%%%%+:::--====------=++*##%%%%%%%#*****+--#%%%%%%%
@%%%%*::::--------=-=++*#%%%%%%%%#**+=+=--*%%%%%%%
@@@%%%#-:::::::--++++*#%%%%%%%%%%%*+=-==---*%%%%%#
@@%%%%%%::::----+##%%%%%%%%%%####*+--=+=---=%%%%*=
@%%%%%%#:-==-===*%%%%##%%%%%##**=---:::--===#%*=--
%%%##%%#=+++++-==*%%#*%%%%%%##*+==----===+--#*:::-
*****++***==++=====+++++******+=====+====+==**==--
*#**++++**========+===+*###*#*+==+++++=+===-******
******++++===========**********==++++++=====+++++*
***********+========+*********+===++========-::---
**+*++******+==---==+***##**##**==---===-=+++++===
****+++****#**===---==+****+**++++=------+***+==--
***********#*#*+++====++******++++++=++=====++----
********#***+*************+=====++++=+++***+=+=---
************+++++++++++++++++=+++=++++++*+=+++++=-
##****#*****++++++++*+++***+++=======++=++===-=+==
##############****+=+*+*******++++++=+++==-=++++++
#########%%%#*+=+#*=+=+#######****#**#****+++++***
######%######**##*=-=*#########*####*##*##*#*#*##*
#*#####%##%##*=-==--+*#########*#######*##########
###%###%%####*+==--:-+##*###################**####
Paper № 05Multilingual

Frontier voice AI on Indian languages

ElevenLabs Scribe v2 leads at WER 0.277. AssemblyAI Universal-3 Pro silently collapses 5 of 9 Indic languages to romanized Latin script — invisible to WER alone. Only Scribe v2 and Sarvam Saaras v3 cover all 10 target languages.

April 16, 2026 · 12 MIN READ

Read paper

----:::::..................................::
--:::::::...........................:::...:::
::::..........................:::::::::::::::
:::.........................::::::::::::::---
:::.........::::::.......::::------------==++
:::..........::::::....:::::-------==++++**#*
::...........:::::::::::----===++++*##%%@@@%@
:......:...:::::::::------=====+++*#########%
:...::.:...:::::::::::::----::----===--=----=
................:::.....:::-----:::::::::::::
..............::::.....::::::-----:::::::::::
.............:::............:::----::::::::::
.............................:::::......:::::
......:::...............................:.:::
....................::::.............:..:::::
..................::::::..............:=---::
.................:::::::..............-:::::.
.....................:.........:===+=-:......
:...............................:::::-:.::...
::..........::................:--::::.   ::..
:.......::..:::=-==..........................
.  .::::::::::::::::.................::.... .
   ........  .  ..:-==----:................  
  .:... ..      .....-::::.............. ..  
  .:.:.... .. .......:::--::::...............
.. ..............  .:::---:::::........:--=-:
..      .-:----... ..:..:..:.:::.  .....::::.
. :.   :---::==-=..::.  .....:::.............
..:::::::-::::--.....:.   ..-===+=-:-....:=--
.::::::::..::::........   .:::---::......:--=
.:. ..      .....:........::::.....:::::::.:-
=-:.      ...............::-:---::-----------
+++:.  ................:.:::::--========---==
+=*=*---:. ...............:---------======-==
==+*#*#*:......::::----------==+=============
==#=*:=-    .----=-===++===+++++===+++++++++=
:--       ..-====+**++++++++++++=============
   .  .  ...:***#****************+++++++++===
  .:.:......:*####**#######**######***+++++==
 ::::.......:*##################**#*#***+++++
Paper № 04Safety

Sycophancy, hallucination, and calibration in video models

Sycophancy is universal: authoritative framing reduces contradiction detection by 16–18 pp across all seven frontier video models. No model is immune. All are overconfident when wrong.

April 7, 2026 · 14 MIN READ

Read paper

  ....:---.  .         .:.                   
     -=-:...               .                 
   .-:.              .:.         .           
   ..        .       :==:. .... ....         
-.  .            .    :==.........           
 .:... :=+++=-.         :.                   
    .    :+=++*+=-.      .      .            
          .. ...:--=-::..    ...            .
             .  ...:........  .....::.       
       ..==-..  .... ........ .  .:=-=::... .
       ..-=*#+-..... .. .        .:=+*+=..  .
     . ..:.-++++.::.  ....  ...  ...:-+-  .. 
     .. .....::=*-.    ....... .. . ..=:   ..
      ......... :: .   .  . .......   :   ...
...    ..    .. .: ..-::      . .. ...:===-:.
....    .    ..........:-:. ..-=--=++++=---+:
...       ..  ...........-=-:....::...:. .+--
....      ...  .......==: :===:...  ......:::
.....  ......   .....:+*+- :-===:.... .:==:=+
.....  .. ..  .  ..:-==+++-::-===-: ..-:-=+=-
............      ...:=+=+:....:-=- .:=+*=-:.
. ...........     ....:+**. ........-+*+-:::.
  ...........  ..  .....++........:+=-:...::.
....    .....      .....-:.......:-:..::--::.
..       . ...     .....-. ........ .:-::..:.
........... .      .....:....... .:::....  .:
...........        ...............::      .:-
..::......... .    .................     .---
...-::... . . .   ..................    .:--=
....::.     .    ....::----::.......   .::---
.................:-=====-:.........    :----=
..........:----====---:.:.........     :--:--
...........::.......:...............   .:----
....................................    .::--
....................-=--...........      ::--
.............-==+++++=:.............      .::
..........:-++***+=---..............       .:
.........-==+++=-=-::-:-:.-=-:::::-:::.     .
........------:........:-:---===-:::..       
=:....................::-::..::...::  ..     
Paper № 03Video

Temporal reasoning and workflow understanding in video models

Across seven frontier video models, accuracy falls 44 points from 30-second to 5-minute clips. Error detection is catastrophic frontier-wide. The best model catches barely one in four intentional errors.

March 24, 2026 · 13 MIN READ

Read paper

 ....                                       .
 .....                                      .
......                                     ..
....                                       ..
                                             
                                             
                                             
                                             
                                          .  
                                             
                                             
                                             
                                             
              . ...                          
             .......::.                      
            ......:.::::.                    
          ....:::::::----.                   
        ....:::::::-:-----:                  
       ...:::::---------===.                 
     ...::----==-=====+++=+-                .
.....:------====+++++++*++*+                .
 .::--------==-==+==++++++*#=   ....:--::::-=
.:-----------------====+=+***-==*##%%%%%#####
.:--------------::----==+***##%%%%%%%%%%##*#%
:--------------------====+==+#%%%%%%%%%%**#%%
::-----------------------=--=*%%%%%%%%%#++*%%
:::::---==-------=====-====++*#%%%%%%%%#+-+%%
::---:------=---=====+++=+*++*#%%####%##**#%%
:::------==---=======+*####---+***#%%%%%@%%%@
.:::::-::--=====++==+#%##*****+**=++#########
::::::::--::----:-=-####+=-=*****+*+=#######%
====+++++*=--:::--:-=+**:=+++**##***+++*+**+*
++*********+-:::::--=**=:--=++*++**+***#**--=
***###**+=-:....::--+#%=-::-===+----=**+*#+--
*##*+=:......   ...:-*#=-::-====-----==-=+***
#*=:.........  ......:--:::--=#*+--------+=*+
###*+=-====-:.......:::..::---++**+==++------
####**#####%###*-:....::. ....:-====-===----=
####**##++#####*****+=+****=::....::=+===----
*+++=-::::--::-=***#%%%##%%%#%*+-:..::--====-
Paper № 02Fairness

The linguistic equity gap in multilingual speech AI

Three of five dedicated ASR providers cannot serve low-resource African languages. Only ElevenLabs Scribe and Gemini 2.5 Pro produce usable transcripts, statistically tied at the top.

March 10, 2026 · 12 MIN READ

Read paper

==---=+::--=-:::.....::::--==++*-.:-----::---
::-::==--=-:.:::.....::---++****:.-=-=-:----=
::::---:::::........::--=+***++-.:---::-=====
----::::-::::.::....::-+++*****=..:-:::-===++
***+-::::-:.:.:.....:-+******+-.....:--=++++*
**++++=-:.:..::....::=*****++-....:::--+=++++
+++***=-:..::::....:-++++*++=.....::::=++++++
+*+*++====--::....::=+++**+-.. ...:::-+**+*+=
++++====++-::....::-++*++=:.  ...:::-=+++++++
+++++++*#+-:.....::++***+:.   ...::::-=======
==++++++*=:.....:-=***+=:.   .....:..:-=+=+=-
+***+++*+-:....:-+**+*+-.........:::---=++=-:
++++=+=+-::....:+*++=-....:::-=-:::-+====-:::
========-::.:-:-+++=....:::::-+*=::::::::::::
=====-------=++===-:::.:::::::=*=::::::::::::
--=--------===++++-::::::::::-*+--=::::::::::
::::::::::::--==++-:::-:::::-=+--::::::::::--
=-------::::::--+++=--:--::--+=:::::::::-----
++====--::::::::-+++=-----::==::.:::::::::--=
*+==-:::::::::::-++++=-::::-=::::::-:::::--==
=+**+--::::::::::-++++=-:::-:.:-:=-::::::-===
-++===+*=-:::::::-+++++----.:::-::-=-----===+
===---:--==---:::-==+++=+=-:.:::--=--------==
+===-------:::----=*+==+++:::.:::-==-====----
====-:-::--::::::-=+++++=--:::--:----------::
=++=::::----:::::-:-=+==+-=--=-=---::::--::::
--:...:::---:-:::--===+=++=+=---==------:--::
.....=++-:-==---=--===*+==-------======--::::
.........:--=-==--------::::::::----======--:
:::......---:-===-::::::::-------=++========-
-::-::::.::--:--===::----=--:---+*+*++===++==
===+=-=-.:-:..:::--:--======-=*++====+===+==-
=--===--:::..::.::::-==----=-====------======
+=++--::::.:::::::::-----==+=====-::-------==
+****=---:.:-::-:::::--==========+===--------
=+++=+=:-::-:::::::::-==+=====+======-==-----
--=-:::::::::::::.:::-===-=-=-----------===--
====-:::::::--:::::::------------------=====-
=====-::::-++=---:::---:----::::::----=====--
+===--==--=+=-::::::--::::-==-==------=+====-
Paper № 01Evaluation

Frontier voice models on professional speech

Across 12 systems spanning five dedicated ASR providers, four audio-native MLLMs, and three Claude text controls, ElevenLabs Scribe beats every audio-native multimodal LLM on WER. GPT-4o Audio is the reliability outlier.

February 24, 2026 · 11 MIN READ

Read paper

All research

Get in Touch

Interested in our research?

Whether you're a lab exploring data composition, a domain expert with unique datasets, or simply curious about our work — we'd love to hear from you.

Get in touch