Skip to yearly menu bar Skip to main content


Session

Research-Track Oral Presentation: R17: Edge and Mobile

Grand Ballroom 2
Thu 21 May 1 p.m. PDT — 2:30 p.m. PDT
Abstract:
Chat is not available.

Low-latency delivery of satellite imagery is essential for time-critical applications such as disaster response, intelligence, and infrastructure monitoring. However, traditional pipelines rely on downlinking all captured images before analysis, introducing delays of hours to days due to restricted communication bandwidth. To address these bottlenecks, emerging systems perform onboard machine learning to prioritize which images to transmit. However, these solutions typically treat each satellite as an isolated compute node, limiting scalability and efficiency. Redundant inference across satellites and tasks further strains onboard power and compute costs, constraining mission scope and responsiveness. We present EarthSight, a distributed runtime framework that redefines satellite image intelligence as a \emph{distributed decision problem} between orbit and ground. EarthSight introduces three core innovations: (1) \emph{multi-task inference} on satellites using shared backbones to amortize computation across multiple vision tasks; (2) a \emph{ground-station query scheduler} that aggregates user requests, predicts priorities, and assigns compute budgets to incoming imagery; and (3) \emph{dynamic filter ordering}, which integrates model selectivity, accuracy, and execution cost to reject low-value images early and conserve resources. EarthSight leverages global context from ground stations and resource-aware adaptive decisions in orbit to enable constellations to perform scalable, low-latency image analysis within strict downlink bandwidth and onboard power budgets. Evaluations using a prior established satellite simulator show that EarthSight reduces average compute time per image by 1.9$\times$ and lowers 90th percentile end-to-end latency from first contact to delivery from 51 to 21 minutes compared to the state-of-the-art baseline.


IntAttention: A Fully Integer Attention Pipeline for Efficient Edge Inference

Wanli Zhong ⋅ Haibo Feng ⋅ Zirui Zhou ⋅ Hanyang Peng ⋅ Shiqi Yu

Deploying Transformer models on edge devices is limited by latency and energy budgets. While INT8 quantization effectively accelerates the primary matrix multiplications, it exposes the softmax as the dominant bottleneck. This stage incurs a costly $\mathrm{dequantize}\rightarrow\mathrm{softmax}\rightarrow\mathrm{requantize}$ detour, which can account for up to 65\% of total attention latency and disrupts the end-to-end integer dataflow critical for edge hardware efficiency. To address this limitation, we present \emph{IntAttention}, the first fully integer, plug-and-play attention pipeline without retraining. At the core of our approach lies \emph{IndexSoftmax}, a hardware-friendly operator that replaces floating-point exponentials entirely within the integer domain. \emph{IntAttention} integrates sparsity-aware clipping, a 32-entry lookup-table approximation, and direct integer normalization, thereby eliminating all datatype conversion overhead. We evaluate \emph{IntAttention} and demonstrate consistent and substantial gains. Our method achieves up to \textbf{3.7×} speedup and \textbf{61\%} energy reduction over FP16 baselines and \textbf{2.0x} faster than conventional INT8 attention pipelines on Armv8 CPUs. These gains are achieved with high-fidelity accuracy comparable to baselines across diverse language and vision models, enabling practical and efficient Transformer inference on commodity edge devices.


LEANN: A Low-Storage Overhead Vector Index

Yichuan Wang ⋅ Zhifei Li ⋅ Shu Liu ⋅ Yongji Wu ⋅ Ziming Mao ⋅ Yilong Zhao ⋅ Xiao Yan ⋅ Zhiying Xu ⋅ Yang Zhou ⋅ Ion Stoica ⋅ Sewon Min ⋅ Matei Zaharia ⋅ Joseph Gonzalez

Embedding-based vector search underpins many important applications, such as recommendation and retrieval-augmented generation (RAG). It relies on vector indices to enable efficient search. However, these indices require storing high-dimensional embeddings and large index metadata, whose total size can be several times larger than the original data (e.g., text chunks). Such high storage overhead makes it difficult, or even impractical, to deploy vector search on personal devices or large-scale datasets. To tackle this problem, we propose LEANN, a storage-efficient index for vector search that recomputes embeddings on the fly instead of storing them, and compresses state-of-the-art proximity graph indices while preserving search accuracy. LEANN delivers high-quality vector search while using only a fraction of the storage (e.g., 5% of the original data) and supporting storage-efficient index construction and updates. On real-world benchmarks, LEANN reduces index size by up to 50× compared with conventional indices, while maintaining SOTA accuracy and comparable latency for RAG applications.


REPARO: LOSS-RESILIENT GENERATIVE CODEC FOR VIDEO CONFERENCING

Tianhong Li ⋅ Vibhaalakshmi Sivaraman ⋅ Pantea Karimi ⋅ Lijie Fan ⋅ Mohammad Alizadeh ⋅ Dina Katabi

Packet loss during video conferencing often results in poor quality and video freezing. Retransmitting lost packets is often impractical due to the need for real-time playback, and using Forward Error Correction (FEC) for packet recovery is challenging due to the unpredictable and bursty nature of Internet losses. Excessive redundancy leads to inefficiency and wasted bandwidth, while insufficient redundancy results in undecodable frames, causing video freezes and quality degradation in subsequent frames. We introduce Reparo — a loss-resilient video conferencing framework based on generative deep learning models to address these issues. Our approach generates missing information when a frame or part of a frame is lost. This generation is conditioned on the data received thus far, considering the model's understanding of how people and objects appear and interact within the visual realm. Experimental results, using publicly available video conferencing datasets, show that Reparo outperforms state-of-the-art FEC-based video conferencing solutions in terms of both video quality (measured through PSNR, SSIM, and LPIPS) and the occurrence of video freezes.


Rethinking DVFS for Mobile LLMs: Unified Energy-Aware Scheduling with CORE

Zongpu Zhang ⋅ ⋅ Y. Charlie Hu ⋅ Qiang Xu ⋅ Jian Li ⋅ Haibing Guan

Despite the rapid adoption of large language models (LLMs) in mobile applications, deploying them efficiently on resource-constrained devices remains challenging due to limited compute, memory, and energy constraints. In this paper, we first evaluate the energy efficiency of state-of-the-art mobile LLM frameworks across multiple models and uncover a key inefficiency: the default governors make independent decisions which can result in 23.0–40.4% longer latency or 5.0–16.6% higher energy use compared to optimal frequency combinations. We then conduct an in-depth analysis to reveal the root cause–the lack of cross-resource coordination of these governors during prefilling and decoding. Building on these findings, we present CORE, a unified, energy-aware governor that jointly coordinates CPU, GPU, and memory frequencies for mobile LLM inference. Experiments across diverse LLMs show that CORE reduces time-to-first-token by 7.0–16.9% and time-per-token by 25.4–36.8% on average, without increasing energy per token.