October 2 - 6, 2023
PARIS, FR

ICCV 2023

2023 International Conference on Computer Vision (ICCV)

ICCV is the premier international computer vision event comprising the main conference and several co-located workshops and tutorials.

We look forward to this year's exciting sponsorship and exhibition opportunities, featuring a variety of ways to connect with participants in person. Sony will exhibit and participate as a Silver sponsor.

Recruiting information for ICCV-2023

We look forward to working with highly motivated individuals to fill the world with emotion and to pioneer future innovation through dreams and curiosity. With us, you will be welcomed onto diverse, innovative, and creative teams set out to inspire the world.

At this time, the full-time and internship roles previously listed on this page are closed.
As such, please see all our other open positions through the links below.

Sony AI: https://ai.sony/joinus/jobroles/
Global Careers Page: https://www.sony.com/en/SonyInfo/Careers/japan/

NOTE: For those interested in Japan-based full-time and internship opportunities, please note the following points and benefits:

・Japanese language skills are NOT required, as your work will be conducted in English.
・Regarding Japan-based internships, please note that they are paid, and that we additionally cover round trip flights, visa expenses, commuting expenses, and accommodation expenses as part of our support package.
・Regarding Japan-based full-time roles, in addition to your compensation and benefits package, we cover your flight to Japan, shipment of your belongings to Japan, visa expenses, commuting expenses, and more!

Technologies & Business use case

Technology 01Technologies & Business use case

Federated Learning and Vision Foundation Model Development
Traditional machine learning training methods require centralizing a large amount of data from diverse sources to a single server. However, the growing concern over data privacy, particularly for applications that involve sensitive personal information, is making this training paradigm a huge concern. Federated learning (FL) revolutionizes the traditional centralized training paradigm by enabling model training from decentralized data without any data sharing with the central server. Meet Sony AI Privacy and Security Team to know more about how we developed extensive experience in FL research and extend it to computer vision application and vision foundation model development. We have published numerous papers in top-tier AI conferences and journals (e.g., NeurIPS, ICLR, ICML, Nature Communications, etc.). Selected related publications:

We are actively recruiting (senior) research scientists and research engineers with expertise in vision foundation models and computer vision applications to join us. Please send your resume to lingjuan.lv@sony.com or weiming.zhuang@sony.com, if you are interested.

You can apply for an internship with us via https://ai.sony/joinus/job-roles/research-intern-privacy-preserving-machine-learning/.

Technology 02Deep Generative Models for Music and Content Creators

At Sony Research, we believe that deep learning models will power the majority of the tools and software used by audio professionals in the future. The advancements of AI in this direction will help creators achieve goals that would be unfeasible with today’s technology, pushing the boundaries of creation in a radical way.
Understanding how audio and media professionals can benefit from the new possibilities offered by AI is at the core of our mission and our work is driven by our willingness to solve real-world problems. Our research spans a wide range of topics, from developing foundational deep learning techniques to more application-oriented works. By working closely with content creators and with world-leading entertainment groups, we aim at creating human-centric technology allowing them to fulfill their wildest artistic dreams.
Meet Sony AI Music Foundation Model team to see how we build up the next generation of AI tools in collaboration with artists and professionals.
A list of our publications is available at https://sony.github.io/creativeai/.

Technology 03Image Recognition Technologies in Sony R&D

Image recognition technology is one of the most successful AI technologies commercialized in Sony products. We have been developing image recognition technologies for more than 20 years. Started from AIBO in 1999, we have developed many functions such as face recognition, human recognition, and object recognition. Our AI model is lightweight and able to run on edge devices in real time. The talk includes a brief history of our image recognition technologies associated with products and services at that time, and current research and development items.

Technology 04Deep Generative Modeling

Technologies like deep generative models (DGM) have the potential to transform the lifestyle of consumers and creators. Sony R&D is developing large-scale DGM technologies for content generation and restoration, which we simply call Sony DGM. We expect Sony DGM to become an integral part of the music, film, and gaming industries in the years to come, and knowing that we at Sony R&D have the unique opportunity to work directly with world-leading entertainment groups within these industries, we want to make the most of this possibility. Current Sony DGM contains two categories: diffusion-based models and stochastic vector quantization technique. We will briefly introduce our current work below. Demonstration of image generation and media restoration are available from the https://sony.github.io/creativeai

GibbsDDRM

Accepted at ICML 2023 as oral
TL;DR: Solving blind linear inverse problems by utilizing the pre-trained diffusion models in a Gibbs sampling manner. Pre-trained diffusion models have been successfully used as priors in a variety of linear inverse problems, where the goal is to reconstruct a signal given a noisy linear measurement. However, existing approaches require knowledge of the linear operator. In this paper, we propose GibbsDDRM, an extension of the Denoising Diffusion Restoration Models (DDRM) to the blind setting where the linear measurement operator is unknown. It constructs the joint distribution of data, measurements, and linear operator using a pre-trained diffusion model as the data prior, and solves the problem by posterior sampling using an efficient variant of a Gibbs sampler. The proposed method is problem-agnostic, meaning that a pre-trained diffusion model can be applied to various inverse problems without fine-tuning. Experimentally, it achieves high performance in both blind image deblurring and vocal dereverberation(*) tasks, despite using simple generic priors for the underlying linear operators. This technology is expected to be utilized in content editing in music and film production fields.

(*) We have confirmed that Gibbs DDRM improves the performance of DiffDereverb, which is presented in our ICASSP2023 paper titled "Unsupervised vocal dereverberation with diffusion-based generative models", thanks to the novel sampling scheme.

FP-Diffusion

Accepted at ICML 2023
TL;DR: Improving density estimation of diffusion models by regularizing with the underlying equation describing the temporal evolution of scores, theoretically supported.
Diffusion models learn a family of noise-conditional score functions corresponding to the data density perturbed with increasingly large amounts of noise. These perturbed data densities are tied together by the Fokker-Planck equation (FPE), a partial differential equation (PDE) governing the spatial-temporal evolution of a density undergoing a diffusion process. In this work, we derive a corresponding equation, called the score FPE that characterizes the noise-conditional scores of the perturbed data densities (i.e., their gradients). Surprisingly, despite impressive empirical performance, we observe that scores learned via denoising score matching (DSM) do not satisfy the underlying score FPE. We prove that satisfying the FPE is desirable as it improves the likelihood and the degree of conservativity. Hence, we propose to regularize the DSM objective to enforce satisfaction of the score FPE, and we show the effectiveness of this approach across various datasets.

SQ-VAE

Presented at ICML 2022
TL;DR: Training vector quantization efficiently and stably with variational Bayes framework.
One noted issue of vector-quantized variational autoencoder (VQ-VAE) is that the learned discrete representation uses only a fraction of the full capacity of the codebook, also known as codebook collapse. We hypothesize that the training scheme of VQ-VAE, which involves some carefully　designed heuristics, underlies this issue. In this paper, we propose a new training scheme that extends the standard VAE via novel stochastic dequantization and quantization, called stochastically quantized variational autoencoder (SQ-VAE). In SQ-VAE, we observe a trend that the quantization is stochastic at the initial stage of the training but gradually converges toward a deterministic quantization, which we call self-annealing. Our experiments show that SQ-VAE improves codebook utilization without using common heuristics. Furthermore, we empirically show that SQ-VAE is superior to VAE and VQ-VAE in vision- and speech-related tasks.

Technology 05Neural RGB-D Fusion Models for Sparse Time-of-Flight Sensing

At Sony we believe that active depth sensing is key to more robust, reliable, and real-time 3D perception. However, depth sensing for mobile devices comes with many challenges, among all low power constraints. We present a software pipeline providing dense depth maps from a sparse time-of-flight (ToF) sensor and a single RGB camera. Our pipeline relies on an original neural depth completion model running at real-time frame rates on a Qualcomm Smartphone Reference Design board equipped with a sparse ToF sensor. Our solution is designed to minimize power consumption whilst providing depth maps viable for AR/VR use-cases.

Technology 06Sony's 3D Environment Sensing - Research and Applications -

3D Environment Sensing is a technology which analyzes photos taken by a device to make it aware of its environment and estimate its position, then recreates a realistic 3D model of its surroundings. Applications include a wide range of fields, such as the entertainment, robotics, and video production industries.

Publications

Publication 01TARGET: Federated Class-Continual Learning via Exemplar-Free Distillation

Authors: Jie Zhang (Sony AI Intern), Chen Chen (Sony AI), Weiming Zhuang (Sony AI), Lingjuan Lyu (Sony AI)
Abstract: This paper focuses on an under-explored yet important problem: Federated Class-Continual Learning (FCCL), where new classes are dynamically added in federated learning. Existing FCCL works suffer from various limitations, such as requiring additional datasets or storing the private data from previous tasks. In response, we first demonstrate that non-IID data exacerbates catastrophic forgetting issue in FL. Then we propose a novel method called TARGET (federat\textbf{T}ed cl\textbf{A}ss-continual lea\textbf{R}nin\textbf{G} via \textbf{E}xemplar-free dis\textbf{T}illation), which alleviates catastrophic forgetting in FCCL while preserving client data privacy. Our proposed method leverages the previously trained global model to transfer knowledge of old tasks to the current task at the model level. Moreover, a generator is trained to produce synthetic data to simulate the global distribution of data on each client at the data level. Compared to previous FCCL methods, TARGET does not require any additional datasets or storing real data from previous tasks, which makes it ideal for data-sensitive scenarios.

Publication 02The Perils of Learning From Unlabeled Data: Backdoor Attacks on Semi-supervised Learning

Authors: Virat Shejwalkar (Sony AI Intern), Lingjuan Lyu (Sony AI), Amir Houmansadr
Abstract: Semi-supervised machine learning (SSL) is gaining popularity as it reduces the cost of training ML models. It does so by using very small amounts of (expensive, well-inspected) labeled data and large amounts of (cheap, non-inspected) unlabeled data. SSL has shown comparable or even superior performances compared to conventional fully-supervised ML techniques. In this paper, we show that the key feature of SSL that it can learn from (non-inspected) unlabeled data exposes SSL to strong poisoning attacks. In fact, we argue that, due to its reliance on non-inspected unlabeled data, poisoning is a much more severe problem in SSL than in conventional fully-supervised ML. Specifically, we design a backdoor poisoning attack on SSL that can be conducted by a weak adversary with no knowledge of target SSL pipeline. This is unlike prior poisoning attacks in fully-supervised settings that assume strong adversaries with practically-unrealistic capabilities. We show that by poisoning only 0.2% of the unlabeled training data, our attack can cause misclassification of more than 80% of test inputs (when they contain the adversary's backdoor trigger). Our attacks remain effective across twenty combinations of benchmark datasets and SSL algorithms, and even circumvent the state-of-the-art defenses against backdoor attacks. Our work raises significant concerns about the practical utility of existing SSL algorithms.

Publication 03MAS: Towards Resource-Efficient Federated Multiple-Task Learning

Authors: Weiming Zhuang (Sony AI), Yonggang Wen, Lingjuan Lyu (Sony AI), Shuai Zhang
Abstract: Federated learning (FL) is an emerging distributed machine learning method that empowers in-situ model training on decentralized edge devices. However, multiple simultaneous FL tasks could overload resource-constrained devices. In this work, we propose the first FL system to effectively coordinate and train multiple simultaneous FL tasks. We first formalize the problem of training simultaneous FL tasks. Then, we present our new approach, MAS (Merge and Split), to optimize the performance of training multiple simultaneous FL tasks. MAS starts by merging FL tasks into an all-in-one FL task with a multi-task architecture. After training for a few rounds, MAS splits the all-in-one FL task into two or more FL tasks by using the affinities among tasks measured during the all-in-one training. It then continues training each split of FL tasks based on model parameters from the all-in-one training. Extensive experiments demonstrate that MAS outperforms other methods while reducing training time by 2x and reducing energy consumption by 40%. We hope this work will inspire the community to further study and optimize training simultaneous FL tasks.

Publication 04Beyond Skin Tone: A Multidimensional Measure of Apparent Skin Color

Authors: William Thong (Sony AI), Przemyslaw Joniak, Alice Xiang (Sony AI)
Abstract: This paper strives to measure apparent skin color in computer vision, beyond a unidimensional scale on skin tone. In their seminal paper Gender Shades, Buolamwini and Gebru have shown how gender classification systems can be biased against women with darker skin tones. While the Fitzpatick skin type classification is commonly used to measure skin color, it only focuses on the skin tone ranging from light to dark. Subsequently, fairness researchers and practitioners have adopted the Fitzpatick skin type classification as a common measure to assess skin color bias in computer vision systems. While effective, the Fitzpatick scale only focuses on the skin tone ranging from light to dark. Towards a more comprehensive measure of skin color, we introduce the hue angle ranging from red to yellow. When applied to images, the hue dimension reveals additional biases related to skin color in both computer vision datasets and models. We then recommend multidimensional skin color scales, relying on both skin tone and hue, for fairness assessments.

Publication 05Uncertainty-aware State Space Transformer for Egocentric 3D Hand Trajectory Forecasting

Authors: Wentao Bao, Lele Chen (Sony AI), Libing Zeng, Zhong Li, Yi Xu, Junsong Yuan, Yu Kong
Abstract: Hand trajectory forecasting from egocentric views is crucial for enabling a prompt understanding of human intentions when interacting with AR/VR systems. However, existing methods handle this problem in a 2D image space which is inadequate for 3D real-world applications. In this paper, we set up an egocentric 3D hand trajectory forecasting task that aims to predict hand trajectories in a 3D space from early observed RGB videos in a first-person view. To fulfill this goal, we propose an uncertainty-aware state space Transformer (USST) that takes the merits of the attention mechanism and aleatoric uncertainty within the framework of the classical state-space model. The model can be further enhanced by the velocity constraint and visual prompt tuning (VPT) on large vision transformers. Moreover, we develop an annotation workflow to collect 3D hand trajectories with high quality. Experimental results on H2O and EgoPAT3D datasets demonstrate the superiority of USST for both 2D and 3D trajectory forecasting. The code and datasets are publicly released: https://actionlab-cv.github.io/EgoHandTrajPred.

Publication 06NeuRBF: A Neural Fields Representation with Adaptive Radial Basis Functions

Authors: Zhang Chen, Zhong Li, Liangchen Song, Lele Chen (Sony AI), Jingyi Yu, Junsong Yuan, Yi Xu
Abstract: We present a novel type of neural fields that uses general radial bases for signal representation. State-of-the-art neural fields typically rely on grid-based representations for storing local neural features and N-dimensional linear kernels for interpolating features at continuous query points. The spatial positions of their neural features are fixed on grid nodes and cannot well adapt to target signals. Our method instead builds upon general radial bases with flexible kernel position and shape, which have higher spatial adaptivity and can more closely fit target signals. To further improve the channel-wise capacity of radial basis functions, we propose to compose them with multi-frequency sinusoid functions. This technique extends a radial basis to multiple Fourier radial bases of different frequency bands without requiring extra parameters, facilitating the representation of details. Moreover, by marrying adaptive radial bases with grid-based ones, our hybrid combination inherits both adaptivity and interpolation smoothness. We carefully designed weighting schemes to let radial bases adapt to different types of signals effectively. Our experiments on 2D image and 3D signed distance field representation demonstrate the higher accuracy and compactness of our method than prior arts. When applied to neural radiance field reconstruction, our method achieves state-of-the-art rendering quality, with small model size and comparable training speed.

Publication 07DynamicISP: Dynamically Controlled Image Signal Processor for Image Recognition

Authors: Masakazu Yoshimura (Sony Group Corporation), Junji Otsuka (Sony Group Corporation), Atsushi Irie (Sony Group Corporation), Takeshi Ohashi (Sony Group Corporation)
Abstract: Image Signal Processors (ISPs) play important roles in image recognition tasks as well as in the perceptual quality of captured images. In most cases, experts make a lot of effort to manually tune many parameters of ISPs, but the parameters are sub-optimal. In the literature, two types of techniques have been actively studied: a machine learning-based parameter tuning technique and a DNN-based ISP technique. The former is lightweight but lacks expressive power. The latter has expressive power, but the computational cost is too heavy on edge devices. To solve these problems, we propose "DynamicISP," which consists of multiple classical ISP functions and dynamically controls the parameters of each frame according to the recognition result of the previous frame. We show our method successfully controls the parameters of multiple ISP functions and achieves state-of-the-art accuracy with low computational cost in single and multi-category object detection tasks.

Publication 08S-TREK: Sequential Translation and Rotation Equivariant Keypoints for local feature extraction

Authors: Emanuele Santellani, Christian Sormann, Mattia Rossi (Sony Europe B.V.), Andreas Kuhn (Sony Europe B.V.), Friedrich Fraundorfer
Abstract: In this work we introduce S-TREK, a novel local feature extractor that combines a deep keypoint detector, which is both translation and rotation equivariant by design, with a lightweight deep descriptor extractor. We train the S-TREK keypoint detector within a framework inspired by reinforcement learning, where we leverage a sequential procedure to maximize a reward directly related to keypoint repeatability. Our descriptor network is trained following a "detect, then describe" approach, where the descriptor loss is evaluated only at those locations where keypoints have been selected by the already trained detector. Extensive experiments on multiple benchmarks confirm the effectiveness of our proposed method, with S-TREK often outperforming other state-of-the-art methods in terms of repeatability and quality of the recovered poses, especially when dealing with in-plane rotations.

Publication 09Taming Contrast Maximization for Learning Sequential, Low-latency, Event-based Optical Flow

Authors: Federico Paredes-Valles (TU Delft, Sony Europe B.V.), Kirk Y. W. Scheper (Sony Europe B.V.), Christophe De Wagter, Guido C. H. E.de Croon
Abstract: Event cameras have recently gained significant traction since they open up new avenues for low-latency and low-power solutions to complex computer vision problems. To unlock these solutions, it is necessary to develop algorithms that can leverage the unique nature of event data. However, the current state-of-the-art is still highly influenced by the frame-based literature, and usually fails to deliver on these promises. In this work, we take this into consideration and propose a novel self-supervised learning pipeline for the sequential estimation of event-based optical flow that allows for the scaling of the models to high inference frequencies. At its core, we have a continuously-running stateful neural model that is trained using a novel formulation of contrast maximization that makes it robust to nonlinearities and varying statistics in the input events. Results across multiple datasets confirm the effectiveness of our method, which establishes a new state of the art in terms of accuracy for approaches trained or optimized without ground truth.

Publication 10：Workshop on Resource Efficient Deep Learning for Computer VisionDetOFA: Efficient Training of Once-for-All Networks for Object Detection by Using Pre-trained Supernet and Path Filter

Authors: Yuiko Sakuma (Sony Group Corporation), Masato Ishii (Sony Research), Takuya Narihira (Sony AI)
Abstract: We address the challenge of training a large supernet for the object detection task, using a relatively small amount of training data. Specifically, we propose an efficient supernet-based neural architecture search (NAS) method that uses transfer learning and search space pruning. First, the supernet is pre-trained on a classification task, for which large datasets are available. Second, the search space defined by the supernet is pruned by removing candidate models that are predicted to perform poorly. To effectively remove the candidates over a wide range of resource constraints, we particularly design a performance predictor, called path filter, which can accurately predict the relative performance of the models that satisfy similar resource constraints. Hence, supernet training is more focused on the best-performing candidates. Our path filter handles prediction for paths with different resource budgets. Compared to once-for-all, our proposed method reduces the computational cost of the optimal network architecture by 30% and 63%, while yielding better accuracy-floating point operations Pareto front (0.85 and 0.45 points of improvement on average precision for Pascal VOC and COCO, respectively).

Publication 11Continual Learning with Deep Streaming Regularized Discriminant Analysis

Authors: Joe Khawand(Ecole Polytechnique, Sony CSL), Peter Hanappe(Sony CSL), David Colliaux(Sony CSL)
Abstract: Continual learning is increasingly sought after in real world machine learning applications, as it enables learning in a more human-like manner. Conventional machine learning approaches fail to achieve this, as incrementally updating the model with non-identically distributed data leads to catastrophic forgetting, where existing representations are overwritten. Although traditional continual learning methods have mostly focused on batch learning, which involves learning from large collections of labeled data sequentially, this approach is not well-suited for real-world applications where we would like new data to be integrated directly. This necessitates a paradigm shift towards streaming learning. In this paper, we propose a streaming version of regularized discriminant analysis as a solution to this challenge. We combine our algorithm with a convolutional neural network and demonstrate that it outperforms both batch learning and existing streaming learning algorithms on the ImageNet ILSVRC-2012 dataset.

Publication 12Spatio-Temporal Convolution-Attention Video Network

Authors: Ali Diba, Vivek Sharma (Sony AI), Mohammad.M Arzani, Luc Van Gool
Abstract: In this paper, we present a hierarchical neural network based on convolutional and attention modeling for short and long-range video reasoning, called Spatio-Temporal Convolution-Attention Video Network (STCA). The proposed method is capable of learning appearance and temporal cues in two stages with different temporal depths to maximize engagement of the short-range and long-range video sequences. It has the benefits of convolutional and attention networks in exploiting spatial and temporal cues for a new spatio-temporal sequence modeling. Our method is a novel mixer architecture to obtain robust properties of convolution (such as translational equivariance) while having the generalization and sequential modeling ability of transformers to deal with dynamic variations in videos. The proposed video deep neural network aims to exploit spatio-temporal information in two stages: 1.) Short Clip Stage (SCS) and 2.) Long Video Stage (LVS). SCS handles spatio-temporal cues dealing with short-range video clips and operates on video frames with 3D convolutions and multi-headed self-attention modeling. Since SCS operates on video frames, this reduces the quadratic complexity of the self-attention operation. In LVS, we mitigate the issue of modeling long-range temporal self-attention. LVS models long-range temporal reasoning using representation (i.e., tokens) obtained from SCS. LVS consists of variants of long-range temporal modeling mechanisms for learning compact and robust global temporal representations of the entire video. We conduct experiments on six challenging video recognition datasets: HVU, Kinetics (400, 600, 700), Something-Something V2, and Long Video Understanding dataset. Through extensive evaluations and ablation studies, we show outstanding performances in comparison to state-of-the-art methods on the mentioned datasets.

Other Conferences

View All