Datasets

As part of LeMaterial, we released and will maintain different datasets, unifying and standardizing data from existing databases:

  • LeMat-Bulk

  • LeMat-BulkUnique

  • LeMat-Traj

    • Released in August 2025
    • LeMat-Traj provides a large-scale dataset, aggregating over 120 million atomic configurations of ab-initio relaxation trajectories, curated from multiple sources (MP, Alexandria, OQMD) and simulation protocols. It enables training and benchmarking of MLIPs and trajectory-aware models (e.g. force regressors, uncertainty quantifiers).
    • https://huggingface.co/datasets/LeMaterial/LeMat-Traj
    • https://arxiv.org/pdf/2508.20875
  • LeMat-Synth

    • Released in September 2025
    • LeMat-Synth is a multi-modal dataset that links materials, their synthesis procedures, and performance data. It was built by parsing scientific literature using VLM/LLM pipelines, and aims to support research on synthesisability prediction and planning. By analyzing over 80,000 open-access papers, LeMat-Synth builds one of the first large-scale datasets of material synthesis recipes, covering 35 synthesis methods and 16 material classes.
    • https://huggingface.co/datasets/LeMaterial/LeMat-Synth
    • https://www.arxiv.org/pdf/2510.26824

More datasets are under development, including surfaces, defects, and electron densities, and we welcome collaborators interested in extending or improving any of the above.