A HCI Researcher’s Summary on AAAI 2018 Conference!

Victor Dibia
Towards Data Science
23 min readFeb 13, 2018

--

Summary of talks from AAAI18 that cover topics in Computer Vision, Machine Learning/Deep Learning(Catastrophic Forgetting), Learning Representations, Knowledge Graphs and Applied AI in general.

This post document contains notes on sessions I attended at the just concluded Artificial Intelligence Conference (AAAI 2018, New Orleans Louisiana). The selection of talks where based on my interests from a HCI and applied AI perspective. These include Human aspects of AI, Vision, Machine Learning/Deep Learning(Catastrophic Forgetting), Learning Representations, Knowledge Graphs and Applied AI in general. I welcome feedback, corrections (typos!) and discussions - please get in touch (@vykthur). For those interested in an additional overview of AAAI, David Abel has also written a detailed summary of AAAI 2018 and covers some sessions I did not attend.
Note: technical talks were between 15–20 minutes and many do not have the papers/slides publicly available; TLDR — some of my notes may lack details.

Saturday Feb 3

Tutorial — Knowledge Graph Construction from Web Corpora — Mayank Kejriwal, Craig Knoblock, Pedro Szekely [Slides]

This tutorial provided an overview of the process involved with creating a knowledge graph using data scraped from websites and was presented by researchers from the information science institute at USC.

Motivation

  • Why Domain Specific Knowledge Graphs (DSKG)?
    Human behavior suggests we already perform domain specific search (DSS. E.g. people going to section on amazon to search for items or going to YouTube to search for videos instead of performing a general google search.
  • DSKGs do a better job of answering domain specific questions.
  • DSS > Keyword search; it codifies domain knowledge across many documents.
  • There is a growing interest in DSKGs, projects sponsored by DARPA

Knowledge Graph Construction

  • A knowledge graph is a set of triples (Head, Relationship, Tail). E.g. Barack Obama (H) was born in(R) Hawaii (T).
  • Generating a knowledge graph from data crawled from a website can be complicated. Manual approaches to this problem are not scalable hence the need for systems that automate this process. Extracting data and entities from tables, graphs, images of plots and even excel files can present specific challenges.
  • Most approaches to automation include a Grammar Induction Approach where a system attempts to learn the underlying structure in crawled data. Examples of these algorithms include RoadRunner, Disjoint Regex etc.

Example Tools for Knowledge Graph Construction.
The final part of the tutorial covered demonstrations on some tools which the authors have developed at their lab for knowledge graph generation — Karma and DIG. They also show an interesting use case where they build a knowledge graph to help with tacking human trafficking.

My thoughts: The tutorial was useful in understanding the process pipeline for crawling websites. It had more manual aspects to it than I had expected (extracted entities need to be manually labelled and results can be quite noisy). I was also looking to learn more about KGs for question answering which was not covered much in this talk.

Sunday Feb 4

[Presidential Address] Challenges of Human Aware AI Systems [Video, Slides]

This talk was presented by the AAAI president (Subbarao Kambhampati) and provided some perspective on human-aware AI (HAAI) — motivations, research challenges and open issues.

Motivation

  • AI today has a curious ambivalence to humans.
  • But humans are needed across many fields where AI can be applied. Intelligent tutoring systems, social robotics, assistants (personal, medical), and in human-robot teams.
  • We should pursue Human aware AI because it broadens the scope and promise of AI.

Research Challenges and Issues

  • Human-Agent teaming requires modeling the Human: Getting a robot/AI agent to maintain an accurate model of the human it is interacting with is a challenge. The accuracy of this model is necessary for creating plans that are optimized for both robot and human. Studies on human-human teams can inform solutions to this problem.
  • Robots/AI agents can recover from these inaccuracies by either sacrificing their optimal model to satisfy the human or offering/requesting information (explicability ) that helps resolve differences. [see slides for more references]. Explicability from an agent is necessary to engender trust.
  • Ethical Challenges — If/when AI agents can model humans correctly, it might give them the ability to “tell lies”.

My thoughts: I found this talk interesting as it helped provide an overview of work being done in the AI planning domain to improve human-robot interaction - efforts to model the human, make plans explicable and perspectives on trade offs.

[Technical Talk] Extreme Low Resolution Activity Recognition with Multi-Siamese Embedding Learning. [Paper]

Example videos of the HMDB and DogCentric datasets. The upper rows show the original HR videos and the lower rows show the 16x12 extreme low resolution videos used in the paper experiments. The first two videos are from HMDB and the other two videos are from DogCentric. Source

This talk (presented by Michael S. Ryoo from university of Indiana), introduces a neural network approach to recognize activity in low resolution images — as low as 16 * 12 pixels!!

Motivation

  • Low resolution processing helps us understand activity at a distance (far field recognition). E.g. what is happening in the background within a video segment (people walking, waving etc).
  • Low resolution recognition affords privacy (faces are so small, become unidentifiable) and battery conservation benefits.

Contribution

  • This work contributes a new two-stream multi-Siamese convolutional neural network. In Low resolution (LR) videos, it is common that two images which originate from the exact same scene often have totally different pixel values depending on their LR transformations.
  • The approach learns the shared embedding space that maps LR videos with the same content to the same location regardless of their transformations.
  • This mapping is leveraged in improving inference when activity recognition is done on the LR videos.

My thoughts: This paper provided some interesting insights on working with low resolutions images — Regular CNNs require a 200 * 200 image!! They discuss the technique of learning mappings from multiple LR to HR image transformations and leverage that knowledge in improving inference accuracy. I find this interesting as approaches like this can inform efforts to perform AI at the edge (allow us work with smaller images, improve inference speed and battery life). Accuracy reported is not very high (37.70%) and they achieve 50fps on a Jetson TX2 running Tensorflow.

[Technical Talk] Less-forgetful Learning for Domain Expansion in Deep Neural Networks [Paper]

Source

This paper provides an approach to tackling an extension of catastrophic forgetting (which they call domain expansion) in neural networks.

Motivation.
The domain expansion problem as the problem of creating a network that works well both on an old domain and a new domain even after it is trained in a supervised way using only the data from the new domain without accessing the data from the old domain. They identify two challenges for domain expansion.

  • First, the performance of the network on the old domain should not be degraded even if the new domain data are learned without seeing those of the old domain.
  • Second, a DNN should work well without any prior knowledge of which domain the input data had come from.

Approaches to Catastrophic Forgetting

  • Fine-Tune Softmax: freeze lower layers and fine-tune the final softmax classifier layer. This method regards the lower layer as a feature extractor and updates the linear classifier to adapt to new domain data. In other words, the feature extractor is shared between the old and new do- mains, and the method seems to preserve the old domain information.
  • Weight constraint approach: use l2 regularization to obtain similar weight parameters between the old and new networks when learning the new data.
  • Model proposed by paper: uses the trained weights of the old network as the initial weights of the new network and simultaneously min- imizes two loss functions.

My thoughts: I was looking to learn more about catastrophic forgetting and this talk (and related references) were very useful towards understanding the state of research on this topic. I note that the evaluation here (and in most other papers) generally use small datasets (e.g. CIFAR 10) to evaluate how well their solutions perform after training on a new domain. It is not clear that this will scale well with larger datasets. Catastrophic forgetting remains an important, open and hard problem.

[Technical A Cascaded Inception of Inception Network With Attention Modulated Feature Fusion for Human Pose Estimation [Paper]

Results of our the paper on the MPII datasets. It is shown that the method is able to handle non-standard poses and resolve ambiguities when body parts are occluded. Source.

Motivation

  • Pose estimation — accurate key point localization for body parts is needed to analyze behaviour in images and video. It needs diversified features: the high level for contextual dependencies and low level for detailed refinement of joints.
  • Localizing human joints is hard. There are challenges due to deformation of the human body, changes in appearances and occlusion.
  • Existing methods have limitations in preserving low level features, adaptively adjusting the importances of different levels of features, and modeling the human perception process.

Contribution.

This paper presents three novel techniques step by step to efficiently utilize different levels of features to improve human pose estimation.

  • Firstly, an inception of inception (IOI) block is designed to emphasize the low level features.
  • Secondly, an attention mechanism is proposed to adjust the importances of individual levels according to the context.
  • Thirdly, a cascaded network is proposed to sequentially localize the joints to enforce message passing from joints of stand-alone parts like head and torso to remote joints like wrist or ankle.

My thoughts: This paper provides insights on ways to tune networks to address specific issues (in this case being robust to errors in pose estimation caused by occlusion and deformation).

[Technical Talk] Graph Correspondence Transfer for Person Re-identification [Paper]

(a) shows misalignment among local patches caused by viewpoint changes. The proposed GCT model can capture the correct semantic matching among patches using patchwise graph matching, as shown in image (b). Source

Motivation

  • Person Re-Identification (ReID) is to associate images with the same identity across different non overlapping camera views. A tonne of implication in video analytics, surveillance etc.
  • ReID is hard because of large appearance changes in different camera views, heavy body occlusions, similarity of many other images in probe set. Spatial misalignment where patches have different appearances in different views/frames.

Contribution

This paper proposes a graph correspondence transfer (GCT) approach for person re-identification.

  • Unlike existing methods, the GCT model formulates person re-identification as an off-line graph matching and on-line correspondence transferring problem.
  • During training, the GCT model aims to learn off-line, a set of correspondence templates from positive training pairs with various pose-pair configurations via patch-wise graph matching.
  • During testing, for each pair of test samples, they select a few training pairs with the most similar pose-pair configurations as references, and transfer the correspondences of these references to test pair for feature distance calculation. The matching score is derived by aggregating distances from different references. For each probe image, the gallery image with the highest matching score is the re-identifying result. Compared to existing algorithms, GCT can handle spatial misalignment caused by large variations in view angles and human poses owing to the benefits of patch-wise graph matching.

My thoughts: Interesting way to leverage offline knowledge in a task. They learn correct patch correspondence for images with certain poses pairs and reuse this knowledge at test time within images with similar pose-pair configurations!

[Technical Talk] End-to-End United Video Dehazing (EVD) and Detection [Paper]

Comparisons of detection results on real-world hazy video sample frames. Note that for the third, fourth and fifth columns, the results are visualized on top of the (intermediate) dehazing results. Source

Some work done by researchers from Microsoft Research China in creating an end-to-end model that does both dehazing and a downstream operation like object detection.

Motivation

  • The removal of haze from visual data captured in the wild has been attracting tremendous research interests, due to its profound application values in outdoor video surveillance, traffic monitoring and autonomous driving.
  • CNNs have been applied in dehazing images with SOTA results (deHazeNet, AOD-Net).
  • Limited effort in exploring video dehazing, either by traditional statistical approaches or by CNNs

Contribution

  • End-to-end model (Li et al. 2017a; Wang et al. 2016), that directly regresses clean images from hazy inputs without any intermediate step. This model outperforms multi-stage pipelines
  • Embrace the video setting by explicitly considering how to embed the temporal coherence between neighboring video frames when restoring the current frame.
  • The authors show that a pipleline for joint optimization of of video dehazing and video object detection, performs better than running detection on a video stream that has been independently dehazed. Beyond the dehazing part, the detection part has to take into account temporal coherence as well, to reduce the flickering detection results.

My thoughts: The researchers identify a joint optimization pipeline that helps improve object detection accuracy for hazy images. This work was helpful in clarifying on the utility of joint optimizations and end to end models.

Monday Feb 5

[Invited Talk] Interactive Machine Learning [Charles Isbell and Michael Littman]

This talk was presented by Charles Isbell from Georgia Tech and Michael Littman from Brown. It was an interesting, engaging (sometimes witty)and interactive presentation by both presenters. Its started with a discussion on reinforcement learning (RL) by Charles, the role a user plays in RL and then a lively back and forth debate-style interaction where Michael walks us through on his research that informs how human feedback should be used in RL. Some notes below:

Reinforcement Learning (Charles Isbell)

  • Reinforcement learning: Assume there is an agent which can interact with an environment and get some response which can be used to modify future interactions. Modeled as sequential markov models. Allows us learn from small datasets.
  • Interactive Reinforcement Learning: leverage and extract good data from humans.
  • Why should we trust the user? People are the final arbiter; they know the goal best.
  • Humans are sometimes dumb: Sometimes it is important to ignore human input (or avoid literal interpretations of it) to achieve optimal outcomes. Consider a simulation where an agent could save humans in a field containing areas of radiation. A simple “Go to human, avoid radiation” can lead to less humans being saved. Also, humans stop giving feedback after some point, after which they only give feedback. E.g. adults don’t get complements for learning to ride a bike … but kids do.

Interactive Machine Learning (Michael Littman)

  • Sometimes, agents learn the wrong thing. E.g. a dog in a simulation that would walk into a garden, so that it could get the positive reward for getting out of the garden. This suggests we should revisit the way we use rewards.
  • Feedback should be a gradient signal modeled as a critique on the agent’s policy/behaviour. Instead of plugging feedback into the agent’s reward system, we use it as an input into an equation that models the agents behaviour/policy (advantage function). This approach better supports diminishing returns, differential feedback and policy shaping.
  • Humans are not dumb: We just have to do better in interpreting human input and how it is used in updating an agents parameters. We cannot outsource our understanding of humans. Humans are the loop and ML researchers should engage more with human behaviour researchers.

[Invited Talk][Human AI Collaboration] Towards Theory AIs Mind .. Devi Parikh [Paper]

(a) These montages an agent’s quirks. For a given question, it has the same response to each image in a montage. (b) Human subjects predict the success or failure and output responses of a VQA agent (called Vicki). Source

Interesting Talk by Devi Parikh from Georgia Tech where she argues for more research that quantifies a users understanding of the limits of AI.

  • We are increasingly working with AI agents that can take in images and understand objects, activity, scenario, dialog, describe objects etc. See demos on cloudcv (visualChatbot, VisualQA).
  • Natural Language is important because humans are the consumers of the reasoning by machines.
  • Human AI teams are the future. Agent should have a good model of human wants, intentions, beliefs etc … have a theory of the mind of the user. The human must also have a feel of the capabilities, limits, beliefs etc of the machine — theory of AI’s mind.
  • How do humans understand AI. How does a human approximate a neural networks? To what extent does explainability efforts help humans understand AI limits?
  • Agents are Quirky.
    Most agents can’t count, reason , consistency, or consult external knowledge base. E.g don’t know how close 3 is to 4. They have limited vision , limited language capabilities, cant reason, don’t see the priors in the world (e.g. bananas are mainly yellow), learn from datasets that have biases and attempt to answer all questions.
  • Agents are quirky in a predictable way. We want to quantify this. She conducted an experiment with an agent trained to perform VQA (See Hierarchical Question-Image Co-Attention for Visual Question Answering). Given an image, participants are asked to estimate if the agent will correctly answer the question.
    — 60% of people predicted correctly the agents performance.
    — explanation modalities increase prediction …. Saliency maps and other modalities result in increased prediction.
  • She suggests the approach of training agents using supervised learning and finetuning them using reinforcement learning.
  • Next Steps and some important papers: Collaboration games where humans describe an image and the agent tries to guess the image. Both share the reward for correct guesses. Some papers on this work. -https://visualdialog.org/
    - [HCOMP 2017] Evaluating Visual Conversational Agents via Cooperative Human-AI Games .
    - [ICCV 2017] Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning
    - [CVPR 2017] Visual Dialog.

As a HCI researcher, I found Devi’s presentation to be one of the most interesting talks at the the conference. I particularly appreciate the effort made to develop systems that advance the SOTA (e.g. VQA in her work) and also experiments that explore human factors associated with interaction with these systems. My feeling is that Devi’s work represents a good example for research that successfully combines AI and HCI.

Tuesday Feb 6

[Invited Talk] Fair Questions, Cynthia Dwork

Cynthia provided thoughts that span her rich work on differential privacy, algorithmic fairness and bias.

  • Biases can influence Algorithmic fairness
    Data, algorithms, and systems have biases embedded within them reflecting designers’ explicit and implicit choices, historical biases, and societal priorities. They form, literally and inexorably, a codification of values. Unfairness of algorithms — for tasks ranging from advertising to recidivism prediction — has recently attracted considerable attention in the popular press. (Dwork, Hardt, Pitassi, Reingold, and Zemel, 2012).
  • Notions of Fairness in Classification
    — Statistical Parity :the property that the demographics of those receiving positive (or negative) classifications are identical to the demographics of the population as a whole
    — Treat similar individuals similarly: any two individuals who are similar with respect to a particular task should be classified similarly
    — effect of composition on individual and group fairness.

Cynthia concludes with some notes

  • Education has two uses …. Helping bring up the disadvantaged and the main task of imparting knowledge. Compensate less as education progresses.
  • Negative effect of AI becomes more profound if it compounds existing issues. E.g increasing premiums for at risk mothers (who need financial support the most)
  • There is a need for metrics that describe “fairness” of a dataset.

[Technical Talk] Selective Experience Replay for Lifelong Learning.

A talk by David Isele and Akansel Cosgun from Upenn Grasp lab.

Motivation

  • We usually want to train a model within a controlled environment and hope of its graceful performance (ability to adapt to novel situations)in the wild.
  • In many cases the learned model may not be a good fit immediately but can represent an excellent initialization point. We can then initiate some form on learning based on this initialization.
  • However, catastrophic forgetting introduces complications when we attempt this “online” learning. Approaches that support online learning without forgetting is needed.

Contribution

  • This work adopts inspiration and parallels from biology and suggest selective experience replay. In order to be useful for a continuous learner, experience replay must be modified to allow adaption to changing environments while holding on to past experiences that are still relevant. See Mnih et al 2015 for deep reinforcement neural networks and experience replay.
  • Experience replay leverages long and short term memory. However we cannot hold all experiences in memory and need strategies for selecting which memory to hold on to.
  • This work provides an evaluation on experience selection strategies (including theirs) and how they impact lifelong learning performance — Surprise, Reward, Maximize Coverage, and Match Distribution.

My thoughts: This paper discussed an additional approach to addressing catastrophic forgetting within the context of lifelong learning. This is a more interesting use case (compared to “offline” models which may not need to “learn” new information frequently under constraints). Looking forward to reading the paper when it becomes available.

[Technical Talk] Hierarchical Attention Transfer Network for Cross-domain Sentiment Classification

Interesting applied AI paper that explores the problem of cross-domain sentiment classification.

Motivation

  • When there are differences (in distributions) for the training and test set, we generally observe a reduction in accuracy. For example, if we train a model to detect sentiment within text from books/novels, we observe a lower classification performance if we apply this model to a different domain e.g restaurant review text. Consider the examples [Novel Text] “A sobering tale”, [Restaurant review text]“I felt sober after the meal”. The word sober might take on different sentiments (and meaning).
  • Clearly, we need scalable and end to end trainable approaches that improve cross domain classification and address the issues exemplified above.

Contribution

  • This work exploits linguistic characteristics of text and identify two types of words — pivots and non-pivots. Pivots are words that convey sentiment but are invariant to domain. E.g. the word disgust is likely to be negative in most domains. Non-pivots: domain specific words that co-occur with pivots.
  • Learn alignments between pivots and non-pivots and how they relate to classification, and then transfer this alignment across multiple domains. To learn this alignment, they propose two Hierarchical Attention Transfer Networks (HATNs) that learn pivots (P-Net) and non-pivots (NP-nets).
  • The authors test their approach across 5 different domains and show improvements in accuracy from the baseline.

My thoughts: Interesting and practical demonstration on applied AI. This work is also an example of ways in which deep domain expertise (in this case linguistics) can inform the design of neural network architectures. Interestingly, as more progress is made on networks that independently learn the right architecture (e.g. Googles work with AutoML), it will be interesting to see how they compare with such solutions based on deep domain expertise.

[Technical Talk] Deep learning from Crowds [Paper]

: Bottleneck structure for a CNN for classification with 4 classes and R annotators. Source

Motivation

  • Deep learning has enabled SOTA results in various domains and benefits from labelled datasets. But for many problems this data is not available.
  • To address this, we use the crowd, and sometimes we need crowds of experts (e.g. for medical diagnostics). While this works, it often requires aggregating labels from multiple noisy contributors with different levels of expertise.
  • Several possible aggregation strategies:
    - Majority voting (naively assumes all annotators are equally reliable).
    - Jointly model the unknown biases of the an- notators and their answers as noisy versions of some latent ground truth using expectation maximization (EM) style algorithms

Contributions

  • Authors propose a novel crowd layer which enables them to train neural networks end-to-end, directly from the noisy labels of multiple annotators, using only backpropagation.
  • This alter-native approach not only avoids the additional computational overhead of EM, but also leads to a general- purpose framework that generalizes trivially beyond classification problems.
  • Empirically, the proposed crowd layer is shown to be able to automatically distinguish the good from the unreliable annotators and capture their individual biases, thus achieving new state-of-the-art results in real data from Amazon Mechanical Turk for image classification, text regression and named entity recognition

My thoughts: Unusually simple (concept and implementation) but effective and insightful solution to a problem. These results may also have implications for the general set of use cases that can be formulated as aggregation or voting problems.

[Technical Talk] Measuring Catastrophic Forgetting in Neural Networks [Paper]

As a network is incrementally trained (solid lines), ideally its performance would match that of a model trained offline with all of the data upfront (dashed line). Experiments show that even methods designed to prevent catastrophic forgetting perform significantly worse than an offline model. Incremental learning is key to many real-world applications because it allows the model to adapt after being deployed. Source

Motivation

  • Once a network is trained to do a specific task, e.g., bird classification, it cannot easily be trained to do new tasks, e.g., incrementally learning to recognize additional bird species or learning an entirely different task such as flower recognition. When new tasks are added, typical deep neural networks are prone to catastrophically forgetting previous tasks.
  • However, incremental learning is important. It is desirable to have a network that can adapt to various perturbations in the test test.
    — known data that changes slightly (eg weather, vibrations etc in driving)
    — new data types (images and then audio)
    — new object classes (e.g new digits in MNIST)
  • Networks that are capable of assimilating new information incrementally, much like how humans form new memories over time, will be more efficient than re-training the model from scratch each time a new task needs to be learned.
  • There have been multiple attempts to develop schemes that mitigate catastrophic forgetting, but these methods have not been directly compared, the tests used to evaluate them vary considerably, and these methods have only been evaluated on small-scale problems.

Contribution

  • The authors introduce new metrics and benchmarks for directly comparing five different mechanisms designed to mitigate catastrophic forgetting in neural networks: regularization, ensembling, rehearsal, dual-memory, and sparse-coding.
  • Metrics
    Stability … how well it remembers the base knowledge
    Plasticity … is it still learning new material?
    Overall performance …. Ratio of stability to plasticity.
  • Experiments on real-world images and sounds show that the mechanism(s) that are critical for optimal performance vary based on the incremental training paradigm and type of data being used, but they all demonstrate that the catastrophic forgetting problem has yet to be solved.
  • Next Steps: Authors propose FearNet, their approach to the catastrophic forgetting problem.

My thoughts: I quite enjoyed this presentation, found it educative and believe the metrics, experiments and insights they they offer contributes to our understanding of the catastrophic forgetting problem. The paper also highlights some intractable qualities of current approaches to catastrophic forgetting (e.g sparsity approach that incurs a 40x memory cost) and why they may be unusable for real world deployment. They also show that performance of many of these approaches do not generalize well to larger datasets.

Wednesday Feb 7

[Invited Talk] How do we Evaluate ML for AI — Percy Liang, Stanford NPL Group

Screenshot from Percy’s slide on future directions in crafting test sets.

In this talk, Percy argues for improvements in how we evaluate ML algorithms that ensure they encode meaningful information (no cheap tricks) and extrapolate well to new distributions.

The need for better metrics

  • ML has been successful. However ML algorithms are susceptible to (adversarial) perturbations and make mistakes humans don’t make.
    — Reading comprehension is also susceptible to adversarial attacks (SQuaD dataset). Suffers the same issue. MSRnet.
    — These issues suggest we are doing something wrong.
    — How do we close the gap between what humans and machines can do?
    — Solutions … Jeff Hinton — Capsule Networks? …. Consciousness priors by Yosua bengi? probablistic programming (Kolter and Wong 2017)?
    Hector levesque 2103 … we need harder tests for ML/AI system beyond cheat sheets.
    — Winograd schema . Consider the example “the dog chased cat up the tree. It waited at the bottom”. Does an algorithm understand what we refer to as it?
    — Interpolation is not sufficeint
    — Extrapolation is harder … where test domain is different from training domain. We need algorithms that extrapolate across domains.

On Harder Tests and Adversarial Examples

  • [Adversarial Examples for Reading Comprehension — EMNLP 2017] In an adversarial setting, the accuracy of sixteen published models drops from an average of 75% F1 score to 36%; when the adversary is allowed to add ungrammatical sequences of words, average accuracy on four models decreases further to 7%
  • Susceptibility to adversarial examples (e.g. a stop sign that looks normal to a human but can fool a network into a wrong classification) might suggest the network has learned the wrong thing in representing the stop sign.
  • Another example of better test benchmarks: A Corpus and Evaluation Framework for Deeper Understanding of Commonsense Stories, ACL 2016 .
    This corpus is unique in two ways: (1) it captures a rich set of causal and temporal commonsense relations between daily events, and (2) it is a high quality collection of everyday life stories that can also be used for story generation. Experimental evaluation shows that a host of baselines and state-of-the-art models based on shallow language understanding struggle to achieve a high score on the Story Cloze Test.
  • Codalab — Percy’s group have created Codalab as a tool that curates models and fosters reproducible computational research. This facilitates some of the benchmarking and evaluation experiments they conduct.

On Defending Against Adversarial Attacks
Percy offers an approach — Certified Defenses for Data Poisoning Attacks

  • To protect against attacks, we can attempt to create a an upper bound guarantee on the performance of a model. A notion of the worst-case loss of a defense across a broad family of attacks for defenders that first perform outlier removal followed by empirical risk minimization.
  • Create a certificate that captures this bound and if we add this into the training objective function, it allows better defense.

Evaluation of Learning

  • Can a model transform a negative review to a positive review? This might mean it has learned something meaningful about the text.
  • Empirical evaluations continue to be a good approach. Aligned with tradition of Turing test and Winograd schema.
  • If training and test set have same distribution, we cannot be exactly sure that our network has learned something deep.
  • When a measure becomes a target, it ceases to be a good measure. Goodharts Law.

[Technical Talk] Sentence Ordering and Coherence Modeling using Recurrent Neural Networks [paper]

Model Overview: The input set of sentences are represented as vectors using a sentence encoder. At each time step of the model, attention weights are computed for the sentence embeddings based on the current hidden state. The encoder uses the attention probabilities to compute the input for the next time-step and the decoder uses them for prediction. Source

Motivation

  • How can we order text in a coherent manner? Coherence refers to properties of text that allows readers understand and absorb information. Their absence restricts understanding.
  • Coherence Modeling: The task of distinguishing between coherent and incoherent text. If we can model coherence, then we can perform the valuable task of sentence ordering.
  • Case example: In multi- document summarization, we want to know how to assemble extracted text from each of different document to generate a coherent summary.

Contribution

  • Authors propose end-to-end unsupervised deep learning approach based on the set-to-sequence framework.
  • Visualizing the learned sentence representations shows that the model captures high-level logical structure in paragraph.

My thoughts: Provides some interesting ideas on approaches for modeling coherence. There may be applications for this as an automated discriminator in learning a GAN that generates coherent stories, conversation or smalltalk.

[Technical Talk] How images inspire poems: Generating Classical
Chinese Poetry from images with Memory Networks.

Motivation

  • Classical Chinese poetry have concise structure, rhythmic beauty, rich content. The most common genre of Chinese poetry is a Quatrain (complete poem consisting of 4 lines).
  • Goal: Given an image, how can we generate a quatrain that both follows the rules of a quatrain, and describe the image.
  • Existing work have several limitations — They exhibit topic drift (the system generates material unrelated to the email), fail to cover image, and cover keywords.

Contribution

  • Authors offer a pipeline that first extract keywords and features (semantic), generates poems line by line, integrate semantic keywords, ignores unimportant information.
  • An attention layer allows model to focus on important part of image and content in preceding lines. Supports an unlimited number of keywords.
  • They develop some interesting evaluation experiments to validate their work.

[Technical Talk] A knowledge Grounded Neural Conversation Model [Paper]

Knowledge-grounded model architecture and evaluation metrics. Source

Motivation

  • Neural network models are capable of generating extremely natural sounding conversational interactions. Nevertheless, these models have yet to demonstrate that they can incorporate content in the form of factual information or entity-grounded opinion that would enable them to serve in more task-oriented conversational applications. If a user mentions an entity not in the dataset, there is no way to handle it correctly.
  • Goal: An end-to-end trainable pipeline, fully data driven that learns to converse using grounding information from a dataset.

Contribution

  • Presents a novel, fully data-driven, and knowledge-grounded neural conversation model aimed at producing more contentful responses without slot filling.
  • Generalize the widely-used sequence to sequence (Seq2Seq) approach by conditioning responses on both conversation history and external “facts”, allowing the model to be versatile and applicable in an open-domain setting. Infuse non conversational data into a conversation.
  • Approach yields significant improvements over a competitive Seq2Seq baseline. Human judges found that model outputs are significantly more informative.
  • Experiment: Data gathered from Foursquare, 23M general domain twitter conversation. Grounding: 1M twitter conversations with entities from foursquare.
  • Mutli task learning and parameter sharing to solve generalization problems.

[Invited Talk] From Naive Physics to Connotation: Learning and Reasoning about the World Using Language — Yejin Choi

One of the more interesting talks where Yejin offers a rich perspective on her work on Natural Language Processing. She starts with the work her team did winning the Amazon Alexa Alexa Prize — creating a socialbot that held natural conversation over 10 minutes.

Teams selected for the Alexa Prize compete in creating socialbots that can converse coherently and engagingly with humans on a range of current events and popular topics such as entertainment, sports, politics, technology, and fashion.

To get a feel of how satisfying a well designed socialbot conversation can be, I suggest a look at the video on the results of the 2017 Alexa Prize. The bots can be interrupted, generate conversation on multiple domains, appear to keep context, etc.

The slides above are a close version of the talk Yejin gave at AAAI.

Our Solution Outperforms SOTA!

One interesting thing that happened without fail at ALL technical talks was the part where each presenter made their claim to victory with the line

“Our Solution Outperforms the State of the Art (SOTA)” . (accompanied by a table to this effect)

Research in AI moves really really fast and I guess one objective way for researchers to demonstrate progress is to show metrics where they report performances scores that beat previous SOTA benchmarks (sometimes by 0.1 or 0.2). Some have observed how these comparison can be manipulated and create flaws in the publication process.

Another interesting buzz word was “end-to-end trainable” — referring to ML pipelines that train all models using a single/joint loss function. Its both desirable and fashionable to have an end-to-end trainable model which ofcourse beats SOTA!

Conclusion

AAAI 2018 was a really well organized conference, great venue (New Orleans was lit), and I learned alot! More importantly, it helped me identify areas and concepts within AI that I needed to invest more time understanding. It was also great to present my work with colleagues on a Cognitive Assistant for Analyzing and Visualizing Exoplanet data. A video description of this project can be found here.

Thanks for reading (turned out to be a lengthy post!). Feel free to reach out on Twitter, Github or Linkedin.

--

--

HCI/Applied AI Researcher with interests in usable AI. Principal RSDE at Microsoft Research, Google Dev Expert for ML. Additional posts on victordibia.com