Monday, August 1
Monday, August 1
- 09:30 - 10:30
- Transforming Data into Actionable and Accessible Words
Transforming Data into Actionable and Accessible Words
Generating natural language captions from images has been researched in-depth. However, generating natural language captions from charts has been less explored due to a lack of datasets. In this talk, I will share our research approaches patented and published at WACV, WWW, CHI, and EMNLP. Further I will share how our work evolve from an intern project, to academic papers, to customer facing demos, to product.
- 10:30 - 11:30
- Bridging AI and Human through Communication: Multimodality, Interpretability, and Fairness
Bridging AI and Human through Communication: Multimodality, Interpretability, and Fairness
Numerous AI systems have been developed from data generated from human communications, and many of them have aided and facilitated human communications. In this talk, I will discuss current progress and future opportunities in multimodal human-AI communication, focusing on two themes: (1) learning from natural communication and (2) using interpretable and fair AI as a medium to communicate with the world. I will first introduce recent works leveraging human gestures as a natural interface to guide and teach autonomous agents such as robots and virtual navigation agents and discuss our novel frameworks that can support unsupervised, communicative learning of human gestures. It will be shown that the semantics of gestures can be learned by agents via communication and incorporated into their policies. In the second part of my talk, I will explain how AI models can serve as a medium to understand human communication and social behaviors and highlight the importance of fairness and interpretability in models. I will introduce several recent works for bias measurement and mitigation including constructing a balanced dataset, mitigating annotators' cognitive biases, and counterfactual bias measurement. I will also discuss our latest work on explaining CNNs using unsupervised visual-semantic attention projection learning and demonstrate its applications for unsupervised visual data analytics.
- 11:30 - 12:30
- Quantifying and extrapolating the capabilities of language models
Quantifying and extrapolating the capabilities of language models
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 200+ tasks, contributed by 400+ authors across 130+ institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline.
Lunch Break · 12:30 - 13:30
Multimodal Learning for Videos
Paul Hongsuck Seo
Human perceives the world through multiple sensory systems (ex., vision, audition, touch, smell), which work together complementing each other and therefore it is natural to build a model processing multiple modalities simultaneously especially given rich multimodal online data such as modern video contents containing visual frames, audio signals, human speech and meta data. Moreover, the emergence of online video sharing platforms enabled to collect large-scale video datasets for such tasks. In this talk, I will introduce our recent studies on multimodal learning in videos and explore the techniques that we can apply to improve the multimodal understanding capabilities of models. Through our extensive sets of experiments, we show significant improvements from incorporating multimodal input signals and effectiveness of those proposed techniques.
Scaling Robot Learning with Skills: Furniture Assembly and Beyond
Despite the recent progress in robot learning, robotics research and benchmarks today are typically confined to simple short-horizon tasks. However, tasks in our daily lives are much more complicated — consisting of multiple sub-tasks and requiring high dexterity skills — and the typical “learning from scratch” scheme is hardly scale to such complex long-horizon tasks.
In this talk, I propose to extend the range of tasks that robots can learn by acquiring a useful skillset and efficiently harnessing these skills. As a first step, I will introduce a novel benchmark for complex long-horizon manipulation tasks, IKEA furniture assembly simulator. Then, I will present skill chaining approaches that enable sequential skill composition for long-horizon tasks. Finally, I will talk about how to learn a complex task efficiently using skills and skill priors extracted from diverse data.
- 15:30 - 16:15
- Embodied AI: From Machine Learning to Learning Machines
Embodied AI: From Machine Learning to Learning Machines
Machine learning (including deep learning) has changed the paradigm of AI from rule-based “manual” programming to data-driven “automatic” programming. However, the current paradigm of machine learning requires some external system that provides them with data, making their scalability limited. Here we argue that the learner can feed itself the data autonomously if it is embodied, i.e. equipped with sensors and actuators. With the perception-action cycle the embodied AI can continually learn to solve problems in a self-teaching way by doing new actions, observing their outcomes, and correcting their own predictions like the humans and animals do. In this talk, I will show some of our studies in this direction of “(embodied) learning machine” research and discuss its implications for achieving truly human-level general AI.
Break · 16:15 - 16:35
- 16:35 - 16:55
- Accurate Node Feature Estimation with Structured Variational Graph Autoencoder
Accurate Node Feature Estimation with Structured Variational Graph Autoencoder
Given a graph with partial observations of node features, how can we estimate the missing features accurately? Feature estimation is a crucial problem for analyzing real-world graphs whose features are commonly missing during the data collection process. This talk introduces our recent work to be presented at KDD 2022, which proposes SVGA (Structured Variational Graph Autoencoder) for accurate feature estimation. SVGA applies strong regularization to the distribution of latent variables by structured variational inference, which models the prior of variables as Gaussian Markov random field based on the graph structure. As a result, SVGA combines the advantages of probabilistic inference and graph neural networks, achieving state-of-the-art performance in real datasets.
- 16:55 - 17:15
- DPar2: Fast and Scalable PARAFAC2 Decomposition for Irregular Dense Tensors
DPar2: Fast and Scalable PARAFAC2 Decomposition for Irregular Dense Tensors
Given an irregular dense tensor, how can we efficiently analyze it? An irregular tensor is a collection of matrices whose columns have the same size and rows have different sizes from each other. PARAFAC2 decomposition is a fundamental tool to deal with an irregular tensor in applications including phenotype discovery and trend analysis. Although several PARAFAC2 decomposition methods exist, their efficiency is limited for irregular dense tensors due to the expensive computations involved with the tensor.
In this paper, we propose DPar2, a fast and scalable PARAFAC2 decomposition method for irregular dense tensors. DPar2 achieves high efficiency by effectively compressing each slice matrix of a given irregular tensor, careful reordering of computations with the compression results, and exploiting the irregularity of the tensor. Extensive experiments show that Dpar2 is up to 6.0x faster than competitors on real-world irregular tensors while achieving comparable accuracy. In addition, DPar2 is scalable with respect to the tensor size and target rank.
- 17:15 - 17:35
- ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning
ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning
The natural association between visual observations and their corresponding sound provides powerful self-supervisory signals for learning video representations, which makes the ever-growing amount of online videos an attractive source of training data. However, large portions of online videos contain irrelevant audio-visual signals because of edited/overdubbed audio, and models trained on such uncurated videos have shown to learn suboptimal representations. Therefore, existing approaches rely almost exclusively on datasets with predetermined taxonomies of semantic concepts, where there is a high chance of audio-visual correspondence. Unfortunately, constructing such datasets require labor intensive manual annotation and/or verification, which severely limits the utility of online videos for large-scale learning. In this work, we present an automatic dataset curation approach based on subset optimization where the objective is to maximize the mutual information between audio and visual channels in videos. We demonstrate that our approach finds videos with high audio-visual correspondence and show that self-supervised models trained on our data achieve competitive performances compared to models trained on existing manually curated datasets. The most significant benefit of our approach is scalability: We release ACAV100M that contains 100 million videos with high audio-visual correspondence, ideal for self-supervised video representation learning.