Looking at the Bigger Picture: Results from FathomNet’s 2025 Kaggle Competition

MBARI

Looking at the Bigger Picture: Results from FathomNet’s 2025 Kaggle Competition

Competition details, top-performing approaches, and how these results can advance visual ocean monitoring

Aug 11, 2025

Laura Chrobak

For the past three years, FathomNet has worked with Kaggle to host machine learning (ML) competitions that advance marine computer vision. These competitions are a powerful way to engage the ML community with ocean science while surfacing key challenges that arise when applying ML in complex, data-scarce environments like the deep sea. Building on previous competitions that focused on out-of-sample detection (2023) and novel category discovery (2024), this year we challenged participants with hierarchical classification.

The response was extraordinary. 1,508 submissions from 109 participants made this our liveliest Kaggle competition yet. With 16 active discussion threads and over 60 comments, the dialogue was thoughtful, constructive, and technically deep. We extend our gratitude to Kaggle, the CVPR-FGVC workshop organizers, and the CV4Ecology Workshop Series for their continued support. Below, we'll explore the competition details, highlight top-performing approaches, and discuss how these results can advance visual ocean monitoring.

The Challenge

Hierarchical classification—architectures that structure data to capture relationships across taxonomic ranks (e.g., from broad categories like families to specific species)—can significantly improve classification accuracy. In the field of marine ecology, accurate taxonomic classification is essential for addressing fundamental questions. What species exist in a particular place? What is the ecosystem biodiversity, and how does it change over time? Questions like these motivated the focus of our 2025 FathomNet Competition, aiming to push the boundaries of taxonomic accuracy and inspire innovative solutions in this space.

In this competition, we introduced a novel evaluation metric to better capture and reward performance on hierarchical classification tasks. This metric works by finding the nearest common ancestor between the ground truth and the prediction within a constrained taxonomic tree, and then calculating the total distance from each classification to that ancestor (i.e., the number of steps between them). The final score is the mean hierarchical distance across the entire test set. A perfect match scores 0, and the worst possible score is 12.

This approach better reflects real-world ecological analysis. A prediction that's close (e.g., right genus, wrong species) still carries useful ecological meaning. However, to simplify evaluation and stay within platform limits, we restricted predictions to 79 dataset concepts—something we hope to improve in future competitions.

This competition was inspired by Bio-CLIP, a vision-language model trained on biological data, which won best student paper at CVPR in 2024. The Tree of Life datasets, which Bio-CLIP was trained on, doesn’t have a large quantity of deep-sea imagery. Some participants experimented with it, but success was limited. Notably, however, Bio-CLIP 2 is trained with FathomNet data. It was released after the competition ended but we’re excited to see any follow-up work exploring its use!

Taxonomy chart of Asteronyx loveni in blue circles. Submission genus Asteronyx highlighted in orange. Score of 1 noted. — Example of the evaluation metric with a score of 1.

Phylogenetic tree showing Animalia kingdom to Asteronyx loveni species. Highlighted branching at Ophiuroidea class and Ophiacanthidae family. — Example of the evaluation metric with a score of 6.

Winning Solutions

Several winning teams discovered that looking at the bigger picture improved their results. Instead of focusing solely on cropped images of individual organisms, or regions of interest (ROI), they incorporated context windows that showed the surrounding marine environment, replicating how expert biologists use habitat and ecological clues to identify species more accurately.

The top-performing team, Yonsei+SSL, achieved an average score of 1.45 on our evaluation metric (with zero being perfectly performant). Yonsei+SSL enhanced fine-grained marine classification by combining object-level and environmental context through multi-scale image inputs. Using a vision transformer, their model processed both the ROI and environmental context simultaneously. A key innovation was their use of a multi-context environmental attention module (MCEAM), which learned to identify meaningful relationships between organisms and their surroundings. Several top-performing solutions, including Databaes, also used this approach. Additionally, Yonsei+SSL incorporated a hierarchical auxiliary classifier that leveraged taxonomic relationships, helping the model understand that closely related species share similar characteristics.

A group of people on steps, posing with fists raised in front of a Yonsei University plaque. One person holds a certificate, smiling. — Winning team photo *Yonsei+SSL* holding FathomNet competition certificate.

This multi-layered approach mirrors how marine biologists naturally think about species identification—considering the animal, its habitat, and its place in the tree of life. “Notably, we extended beyond simply analyzing cropped organisms by modeling the relationship between species and their environment, and incorporating hierarchical taxonomy into our model. We hope our approach contributes to future research in fine-grained marine species classification,” said Team Leader Donghwan Lee. You can read more about their approach at the links provided below.

Yonsei + SSL
- GitHub Repo
- Hugging Face Model
Databaes
- GitHub Repo
- Hugging Face Model
Dal
- Github Repo
- Hugging Face Model
911
- GitHub Repo
- Hugging Face Model
Playtpus Bear
- For a thoughtful look at the iterative, open mindset needed to tackle challenges like these, see Playtupus Bear’s GitHub README, which details what worked, what didn’t, and more.
- Hugging Face Model

Takeaways and Future Directions

This year's competition surfaced some valuable lessons. Not only about how well different approaches worked, but also about the nature of real-world datasets and how we design meaningful evaluation metrics to best interpret results. Specifically, two key takeaways emerged. First, the importance of thoroughly analyzing and understanding the dataset to best tackle a problem. Second, the need for more flexible evaluation metrics that reflect the full biological taxonomic hierarchy, especially when species-level predictions are not always possible.

Dataset design and communication: During the competition, it was revealed that there was a significant distribution shift between training and test data, wherein the test ROIs were, on average, smaller than those in the training set, which surprised many participants. In hindsight, we should have made this clearer in the challenge description. With real-world datasets—especially those requiring hundreds of expert hours and involving rare species—there’s often limited ability to rebalance distributions due to how relatively small the datasets are to begin with. Still, it surfaced an important lesson: ecological data is messy, and real-world applications require robust models that handle imperfect, uneven data as well as thorough data analysis to identify where possible biases are coming from.

Need for better metrics: Many participants asked for evaluation systems that allow predictions at any taxonomic rank when species-level certainty isn’t possible. We agree. Future versions of the competition should support predictions across taxonomic hierarchies. The current evaluation metric is implemented in a way such that each submission must correspond to one of the original 79 categories. Ideally, we would have allowed for predictions across any taxon searchable via the FathomNet Python API. However, for the sake of simplicity and speed, we chose to build an evaluation metric that cross-references a fixed map of those 79 categories. A more ideal evaluation might account for hierarchical proximity across the taxonomic tree, but we constrained the evaluation to the 79 target concepts in order to keep the problem scope manageable in the evaluation code. Developing useful and fair evaluation methods for hierarchical classifiers remains an active area of research. In fact, one of our hopes for this competition is that it highlights the need for further work in this space.

Next Steps

We plan to release the test set publicly to support continued benchmarking and research. Stay tuned next year for FathomNet’s participation in CLEF 2026 as part of LifeCLEF. If you or your organization is interested in sponsoring prizes for future ocean AI challenges, please reach out. We’d love to partner.

Subscribe to our newsletter