Bellu Ai

Research Lab / Data

← Back to Research

A Deterministic, Deficit-Aware Framework for Egocentric Data Collection

Published: January 10, 2026

Abstract

Egocentric data collection systems—such as human-in-the-loop task recording, embodied AI data pipelines, and personalized task execution platforms—must balance three competing objectives: (i) coverage of all required task–attribute combinations, (ii) urgency arising from under-collected data, and (iii) personalization to user context.

Most existing task selection or recommendation approaches prioritize either personalization or popularity, but rarely incorporate explicit coverage constraints or deficit-aware prioritization.

In this article, we present a deterministic, fully vectorized framework for egocentric data collection that integrates coverage targets, continuous deficit-aware boosting, and strong-match personalization into a single probability distribution. The method is expressed entirely in linear-algebraic form, making it efficient, interpretable, and production-ready. While the framework is heuristic rather than theoretically optimal, it provides a practical foundation for controlled, adaptive data collection in real-world systems.

1. Introduction

Egocentric data collection refers to systems in which tasks are performed, recorded, or executed by users or agents under varying contexts (e.g., time of day, environment, device state, user preference). Examples include:

  • Human demonstration collection for robotics
  • Task recording for personal assistants
  • Contextual data gathering for embodied AI systems

A core challenge in such systems is task selection: given many possible task–attribute combinations, which one should be executed or recorded next?

Naive approaches—uniform sampling or popularity-based recommendation—fail in practice. They either:

  • Over-sample already well-covered combinations, or
  • Ignore personalization, leading to poor user alignment.

This work proposes a deterministic task-selection framework that explicitly encodes:

  • Target coverage constraints
  • Deficit-driven urgency
  • User-context alignment

2. Problem Formulation

Consider a task with ( m ) distinct attribute combinations (e.g., task × lighting × time-of-day). Each combination represents a unit of data we wish to collect.

We define the following vectors in ( m ):

SymbolMeaning
( 𝐬 )User-specified slider weights
( 𝐀 )Collected hours so far
( 𝐇 )Target hours per combination
( 𝐒 )Remaining (missing) hours
( 𝐑 )Coverage ratio
( 𝐩rec )Final recommendation probabilities

We also define:

  • ( Mm×k ): attribute matrix (one row per combination)
  • ( Uk ): user context vector

The goal is to compute a probability distribution over combinations that balances coverage, urgency, and personalization.

3. Method Overview

The algorithm proceeds in four conceptual stages:

  1. Normalize user intent (sliders)
  2. Enforce coverage targets
  3. Apply deficit-aware boosting
  4. Personalize via attribute similarity

All steps are vectorized and deterministic.

4. Slider Normalization

User input is first normalized into a probability distribution:

𝐩 = 𝐬 / (1 𝐬)

This represents the desired relative importance of each combination before considering coverage or urgency.

5. Target Hour Allocation

Given a total budget ( Htotal ), we compute raw targets:

𝐇(0) = Htotal · 𝐩

To avoid starvation, we apply a minimum floor ( εfloor ):

𝐇(1) = max(𝐇(0), εfloor)

Finally, we renormalize to preserve the total:

𝐇 = 𝐇(1) · (Htotal / (1 𝐇(1)))

6. Deficit and Coverage

Remaining hours:

𝐒 = 𝐇𝐀

Coverage ratio:

𝐑 = 𝐀𝐇

We define a mask for incomplete combinations:

𝐦 = 1𝐒 > 0

7. Continuous Deficit-Aware Boosting

Rather than binary prioritization, we apply continuous boosting based on how far coverage falls below a threshold ( Rthreshold ).

Shortfall:

𝐱 = max(0, Rthreshold 1 − 𝐑)

Boost multiplier:

𝐛 = (1 + (β − 1)𝐱) ⊙ 𝐦

Boosted distribution:

𝐩boost = (𝐩𝐛) / (1 (𝐩𝐛))

This ensures under-covered combinations receive smoothly increasing priority.

8. Urgency Weighting

Urgency is modeled as a power-law of remaining deficit:

𝐮 = max(ε, 𝐒⊙γ)

Final suggestion distribution:

𝐩suggest = (𝐮𝐩boost) / (1 (𝐮𝐩boost))

9. Strong-Match Personalization

9.1 Attribute Similarity

We compute cosine similarity between each combination and the user context:

𝐜 = MU / (||U|| · ||Mj||)

9.2 Match Emphasis

To emphasize strong matches:

𝐰 = 𝐜⊙δ

9.3 Final Recommendation Distribution

𝐩rec = (𝐰𝐩suggest) / (1 (𝐰𝐩suggest))

This is the final distribution used for task selection.

10. Update Rule

When a combination ( j ) is selected:

𝐀𝐀 + Δ · 𝐞j

The process repeats until:

𝐀𝐇

11. Discussion

Strengths

  • Explicit coverage guarantees
  • Smooth urgency prioritization
  • Interpretable personalization
  • Fully vectorized and scalable

Limitations

  • Heuristic parameter choices
  • No theoretical optimality guarantee
  • Requires empirical tuning
  • No large-scale experimental validation yet

12. Future Work

  • Simulation-based evaluation
  • Comparison with bandit and recommender baselines
  • Theoretical analysis of convergence
  • Extension to multi-user global optimization

13. Conclusion

This article presented a deterministic, deficit-aware framework for egocentric data collection that integrates coverage constraints, urgency, and personalization into a single linear-algebraic formulation. While not positioned as a theoretical breakthrough, the method offers a practical and extensible foundation for real-world data collection systems where balance—not pure personalization—is critical.

Author Note: This work is presented as a technical working article. Feedback, critique, and collaboration are welcome.