In this article, I present a total of 84 papers and articles published in 2020 that I found particularly interesting. For the sake of clarity, I divide them into 12 sections. My personal summary for 2020 is as follows.
– – – – – My personal summary for 2020 – – – – –
In 2020, Transformer model made a huge leap forward. In natural language processing, GPT-3, a large scale Transformer model, has achieved high accuracy in many tasks. Using a large amount of data and a large number of parameters, it has surpassed Big Transfer, which had the highest accuracy in image classification. Fractal image datasets that are free of discriminatory elements and copyright issues may become very important in the future, when the ethics of AI will become more important. The people in industry who do not have access to ImageNet will be happy to have this dataset. There have been many publications on self-supervised learning without labels that rival the accuracy of supervised learning. Deep Fake has become a social problem, and a detection method using biometric signals has been proposed. It is also being used in a positive way to protect the privacy of the victims. Numerical simulation combined with machine learning is emerging. By learning input/output patterns, the simulation can be made ultra-fast, which may lead to greater use by companies.
– – – – – – – – – – – – – -— – – – – – – – – –
Topics
1.Image/video classification tasks 2.Unsupervised learning / self-supervised learning 3.Natural language processing 4.Sparse model / Model compression / inference speedup 5.Optimization/ loss function/ data augmentations 6.Deep fake 7.Generative models 8.Machine learning with natural sciences 9.Analysis of Deep Learning 10.Other research 11.Real world applications
— – – – – – – – – – – – – – – – – – – – – – – – –
1. Image/video classification tasks
— – – – – – – – – – – – – – – – – – – – – – – – –
_The Transformer model has finally made a breakthrough in the image classification field, surpassing CNN-based models to achieve the highest accuracy in ImageNet. However, since it requires a dataset of JFT-300M level (3 million images) and more than 600 million parameters (about 10 times more than EfficientNet-B7), it is not yet easy to use. In 2021, there may be research that surpasses CNN-based models with a lightweight Transformer. With fractal image datasets, you don’t have to worry about copyright or discriminatory factors. This is also great for people in industry who don’t have easy access to ImageNet._
Improves accuracy by correcting the misalignment of resolution during training and inference

Fixing the train-test resolution discrepancy: FixEfficientNet https://arxiv.org/abs/2003.08237
EfficientNet improves accuracy by increasing the resolution, but there is a gap between the resolution during training and inference. By fine-tuning a top layer at a given resolution after training, they filled this gap and achieved better results than NoisyStudent in ImageNet without using external data (SotA).
Explore the structural space of the network, which gives good accuracy

_Designing Network Design Spaces****h_ttps://arxiv.org/abs/2003.13678
Proposes a network structure optimization method that considers a set of network parameters and finds a good set through experimentation, showing specific evaluation methods and results, although it is not fully automated like NAS, but rather closer to the manual design of a skilled person. The search result RegNet is five times faster and with better accuracy than EfficientNet.
U-net in U-net

U2-Net: Going Deeper with Nested U-Structure for Salient Object Detection https://arxiv.org/abs/2005.09007
In order to make effective use of information of various resolutions in salient object detection, they proposed a U²-net that is U-net whose blocks are U-net. The results are more efficient/accurate than previous studies even without pre-training.
Improved accuracy by colorizing each object

Instance-aware Image Colorization https://arxiv.org/abs/2005.10825
In the colorization of images, objects are colored by cropping them with a trained model and processing them in the view that it is easier to colorize individual objects than to colorize the whole image directly. Each object can be colored without being influenced into the background.
Anomaly detection using trained models

Modeling the Distribution of Normal Data in Pre-Trained Deep Features for Anomaly Detection https://arxiv.org/abs/2005.14140
A study of anomaly detection using a pretrained model. Fitting multivariate Gaussian distributions to multiple hidden layer features and detecting anomalies using the Mahalanobis’ distance.They achieved SOTA with an MVTec dataset that is rather close to the real application environment.
Minor categories supported by creating "Others" in the group for each number of data

Overcoming Classifier Imbalance for Long-tail Object Detection with Balanced Group Softmax****https://arxiv.org/abs/2006.10408
In object detection, performance is poor in minor categories . The method for the classification task for such data sets is incompatible with object detection and does not improve the accuracy such much. Therefore, they proposed a method to create groups according to the number of the data and create an "other" class within each group. The proposed method improves the accuracy of minor category classification.
Machine learning models can detect left-right flips

Visual Chiralityhttps://arxiv.org/abs/2006.09512
The left-right flipping used in data augmentations assumes that the data distribution does not change with the flipping. However, in fact, the data distribution is a little different (most people wear clocks on their left hand side, but when you flip it, it looks like you wear it on your right hand side, so the distribution is different from the original data with the clock on your left hand side), and using DL, they were able to determine whether the image was inverted or not. the DL allowed us to determine whether the image was inverted or not. There is a hope that this kind of image can be used as a new tool to extract the ones that change the distribution by reversing the image in this way.
Varying the image size improves accuracy

MIND THE PAD – CNNS CAN DEVELOP BLIND SPOTS https://arxiv.org/abs/2010.02178
The study shows that padding causes positional dependence of accuracy; for networks like ResNet that downsample with stride=2, padding pixels are not applied equally depending on the image size. (The leftmost padding is used, but the rightmost padding is not.)Just change the image size to make it padding treatment equal, and the accuracy is improved.
Pre-training with data containing many categories becomes stronger in one-shot

Closing the Generalization Gap in One-Shot Object Detection https://arxiv.org/abs/2011.04267
In object detection, the study suggests that a model trained with a more number of categories was better for few-shot learning. They suggest that when creating a new dataset, it is better to focus on creating many categories rather than collecting a number of individual categories.
LAMBDANETWORKS; faster/higher accuracy than EfficientNet

LAMBDANETWORKS: MODELING LONG-RANGE INTERACTIONS WITHOUT ATTENTION https://openreview.net/forum?id=xTJEN-ggl1b
When applying self-attention to an image, the latent representation is obtained as a matrix product of Attention and Query per pixel, but they proposed Lambda Layer, which obtains the latent representation by multiplying the map and Query abstracted by a fixed key(regardless of location) /value matrix product . The results are faster and more accurate than EfficientNet.
Beyond the CNN-based model with Transformer

An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale https://openreview.net/forum?id=YicbFdNTTy
Transformer does not work well when trained on a medium-sized dataset such as ImageNet because the inductive bias is less than that of CNNs, so they use a strategy of pre-training on a large JET-300M dataset. They patch up images and process each patch as if it were a document, with results beyond or comparable to those of BiT-L and Noisy Student
Pre-learning with a dataset that is free of copyright and discrimination concerns

Pre-training without Natural Images http://hirokatsukataoka.net/pdf/accv20_kataoka_fractaldb.pdf
A study of pre-learning using fractal images. Unlike natural images, it is a composite image using a function, so the dataset is free of copyright and discrimination concerns. Achieved accuracy exceeding ImageNet pre-trained models in some cases.
More accurate and faster object detection model than EfficientDet-D7.

Scaled-YOLOv4: Scaling Cross Stage Partial Network https://arxiv.org/abs/2011.08036 In this research, the authors have improved YOLOv4, which is faster and more accurate than EfficientDet, by significantly modifying YOLOv3 with the latest techniques with such as Cross-Stage-Partial, Spatial-Pyramid-Pooling, Path-Aggregation Networks, etc. Scaled-YOLOv4 is improved based on YOLOv4. The trade-offs between resolution, width, and depth are explored, and accuracy and speed exceeding EfficietntDet-D7 are achieved.
– – – – – – – – – – – – – – – – – – – – – – – – –
2.Unsupervised learning / self-supervised learning
– – – – – – – – – – – – – – – – – – – – – – – – –
For self-supervised learning ,without using label information, many self-supervised learning methods using two networks have been proposed, such as SimCLR and many of them can achieve accuracy comparable to supervised learning. Google and Facebook, which are likely to have a large amount of data, have a dominant presence in this area.
FixMatch: A Semi-Supervised Learning method, that can be learned even if there is only one labeled data

FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidencehttps://arxiv.org/abs/2001.07685
They propose FixMatch, a semi-supervised learning method that applies cross-entropy between one-hot pseudo-labels of weak translated unlabeled images and predicted labels of images with strong translation. It is possible to learn with even a one label data.
SimCLR, a highly accurate representation learning method with simple mechanisms

A Simple Framework for Contrastive Learning of Visual Representationshttps://arxiv.org/abs/2002.05709
Prof. G.Hinton’s work about unsupervised representation learning. They got competitive score with supervised learning with three strategies: ‘strong data augmentation’, ‘larger network structure’, and ‘taking loss after additional nonlinear transformation’.
Multi-tasking, multi-modal video representation learning

Evolving Losses for Unsupervised Video Representation Learning https://arxiv.org/abs/2002.12177
They proposed a mechanism to aggregate information into a network of RBG video inputs using task balancing and distillation with evolutionary algorithms while allowing multi-task and multimodal unsupervised learning, and achieved very high accuracy in an action detection task using 2 million videos.
Self-supervised learning on par with or better than supervised learning

Big Self-Supervised Models are Strong Semi-Supervised Learners https://arxiv.org/abs/2006.10029
The authors proposed SimCLRv2, which uses only a few labels and performs as well or better than supervised learning. It consists of three stages: unsupervised learning, FineTune, and self-distillation using unlabeled data. Basically, larger models are stronger.
Unsupervised representation learning for action

Aligning Videos in Space and Time arxiv.org/abs/2007.04515
A study of unsupervised detection of behavioral correspondences in video. Learning patch-level correspondence by measuring the distance between patches detected in unsupervised training trackers. The detection accuracy is better than that of existing feature extractors.
Improving Contrastive Learning in Adversarial Perturbations

VIEWMAKER NETWORKS: LEARNING VIEWS FOR UNSUPERVISED REPRESENTATION LEARNINGhttps://arxiv.org/abs/2010.07432
Contrastive learning learns a representation to be invariant to data augmentations, but they are chosen by experts. Instead, the authors proposed a view maker that places adversarial perturbations on images. It can be applied regardless of the domain, and a big improvement in voice and time series data for wearable devices.
Self-supervised learning that incorporates the idea of causal graphs

REPRESENTATION LEARNING VIA INVARIANT CAUSAL MECHANISMS https://arxiv.org/abs/2010.07922
Considering that images are constructed by a causal graph of content (animal species) and style (background, etc.), they propose a self-supervised learning RELIC that learns the images to be invariant to the style. Specifically, the system is designed to classify individual images and match their distributions so that they are invariant to style transformation by data augmentations. It was not only comparable to previous studies, but also effective in reinforcement learning.
A combination of semi-supervised learning and contrastive learning

FROST: Faster and more Robust One-shot Semi-supervised Training https://arxiv.org/abs/2011.09471
By incorporating the loss of contrast learning into semi-supervised learning, FROST is proposed to make learning faster; comparable to the accuracy of supervised learning at around only 128 epochs and robust to hyperparameter selection.
– – – – – – – – – – – – – – – – – – – – – – – – –
3.Natural language processing
– – – – – – – – – – – – – – – – – – – – – – – – –
In Natural Language Processing, a lot of research has been proposed to improve the Transformer, mostly focusing on computational efficiency. [GPT-3](https://arxiv.org/abs/2005.14165), with its 175 billion parameters, has done a great job on many tasks. GPT-3 performed well in many tasks, but it also reproduced un-fairness elements in the data, was unable to interpret physical things such as "does cheese melt in the refrigerator?"
Is the dot-product of Self-Attention mandatory?

SYNTHESIZER: Rethinking Self-Attention in Transformer Models https://arxiv.org/abs/2005.00743
This study revisits the Self-Attention of a Transformer. Self-Attention uses dot-products to take interactions between tokens. In this study, they calculate the attention weight of each token independently or treat attention weight as a training parameter instead of dot-product, but they show the performance is competitive.
Very Large Scale Language Model GPT-3

Language Models are Few-Shot Learners https://arxiv.org/abs/2005.14165
Proposed GPT-3, an autoregressive language model with as many as 175 billion parameters. It shows very high performance in many tasks. On the other hand, some problems were found, such as the reproducing un-fair elements in the data , not be able to interpret the data physically (e.g., "Does cheese melt in the refrigerator?"), and the inability to perform question-and-answer tasks.
Improve efficiency by limiting where you take your Attention

Big Bird: Transformers for Longer Sequences https://arxiv.org/abs/2007.14062
A study that made Self-Attention more efficient. Combining three type attention: random, peripheral only, and full (with only some tokens). They showed that in many NLP tasks it is SOTA and theoretically an approximation of s2s and Turing completeness.
A survey of efficient transformers

Efficient Transformers: A Surveyhttps://arxiv.org/abs/2009.06732
A survey paper on the improvement system of the Transformer model, which has been spreading rapidly in recent years, especially in natural language processing. It is summarized in terms of memory , usage pattern of attention, and so on. It gives a good overview of the flow of the paper from the figures and discussions.
Combining Token and related images to make a pre-learning

Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision https://arxiv.org/abs/2010.06775
Inspired by the fact that humans use not only textual information but also visual information when learning language, they propose a learning method using the "vokenizer " that generates images associated with tokens. Better representations can be obtained by learning the relevant images with the vokenizer at the same time as language model pre-training.
T5 multilingual version of T5

mT5: A massively multilingual pre-trained text-to-text transformer https://arxiv.org/abs/2010.11934
They proposed mT5, a multilingual version of T5, which unifies all tasks into a text2text format and follows a pre-learning to fine-tune strategy, and mC4, a large multilingual dataset containing 101 languages. It has up to 13 billion parameters with the highest performance for various tasks.
Pre-training with data obtained in daily medical work

CONTRASTIVE LEARNING OF MEDICAL VISUAL REPRESENTATIONS FROM PAIRED IMAGES AND TEXThttps://arxiv.org/abs/2010.00747
This is a research for representation learning of a pair of medical images and text data by contrastive learning, which is commonly used in medical routine work. This is a representation learning method to make the distance between the paired text and image closer, which is more useful than ImageNet-trained models and greatly improves the accuracy of image retrieval.
– – – – – – – – – – – – – – – – – – – – – – – – –
4.Sparse model / Model compression / inference speedup
– – – – – – – – – – – – – – – – – – – – – – – – –
In recent days, with the advent of Big Transfer and GPT-3, there has been a trend toward larger models, so speeding up and reducing the weight of models is a major topic in practical applications. There are researches such that the model size can be reduced to 1/47 or a compression method using Wavelet transform has also been introduced.
Large models should be used when computational resources are low

Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers https://arxiv.org/abs/2002.11794
When computational resources are limited, it is suggested that it is better to train and compress a large model than to train/infer with a small model. The larger models converge faster and have less of a drop in accuracy when compressed.
Compress existing models of GAN down to 1/47

GAN Slimming: All-in-One GAN Compression by A Unified Optimization Framework https://arxiv.org/abs/2008.11062
Research to reduce the size of the GAN generator. A combination of distillation, quantization, and pruning. Quantization is usually non-differentiable, but they use pseudo-gradient to enable E2E training. They have successfully reduced the size of an existing model to 1/47th of its original size.
Compressing GANs using the Wavelet Transform

not-so-BigGAN: Generating High-Fidelity Images on a Small Compute Budget https://arxiv.org/abs/2009.04433
This is a research to generate a high-fidelity image using GANs with a small amount of computational resources by reproducing the image from the low-frequency information only using the Wavelet Transform (WT). Specifically, the lost high-frequency information is reconstructed by NN and the inverse Wavelet Transform (iWT) is applied. Although the quality of the generated images decreases a little, they have succeeded in reducing the computational resources from TPUx256 to GPUx4.
Achive High accuracy/low latency with a branch cutting pattern that divides each layer into blocks

YOLObile: Real-Time Object Detection on Mobile Devices via Compression-Compilation Co-Designhttps://arxiv.org/abs/2009.05697
In order to achieve both high accuracy and low latency for mobile ,they proposed a block-puhched pruning method that divides each layer into blocks and learns different pruning patterns in each block. They succeeded in increasing the speed while maintaining the accuracy. The GPU is used to compute the convolutional layer, while the CPU is used to compute the other layers, to further improve the performance.
Quantize the weights by permutating them to make them easier to compress

Permute, Quantize, and Fine-tune: Efficient Compression of Neural Networks https://arxiv.org/abs/2010.15703
The study of quantization by permutating the weights to make them more compressible. It divides the permutated weights into blocks and replaces them with a nearest neighbor in set of vectors stored without using them directly. The ResNet50 can be compressed to 1/31th of its original size while maintaining accuracy.
– – – – – – – – – – – – – – – – – – – – – – – – –
5.Optimization/ loss function/ data augmentations
– – – – – – – – – – – – – – – – – – – – – – – – –
Several new activation functions have been presented that incorporate learnable parameters. The intention seems to be to be able to adaptively determine whether to activate or not. The problem of mislabeling is also a major topic in practical applications, and some research is addressing it by using the difference in learning speed or smoothing the loss function.
Data augmentation techniques performed in hidden layers

PatchUp: A Regularization Technique for Convolutional Neural Networks https://arxiv.org/abs/2006.07794
This paper proposes regularization method PatchUp; such as Cutmix (https://arxiv.org/ abs/1905.04899) as implemented in the hidden layer. It’s a pretty competent result for regularization.
Taking advantage of the difference in learning speed between right and wrong labels

Early-Learning Regularization Prevents Memorization of Noisy Labelshttps://arxiv.org/abs/2007.00151
They found that in a label noise situation, data with correct labels can be learned correctly, while data with wrong labels can predict correct labels at first, but later be pulled by the wrong labels and remember the data. Using this phenomenon, they proposed ESR, a regularization method using a moving average of model outputs. The results are very effective in the presence of label noise.
Improve visualization performance by constraining using one filter for one category

Training Interpretable Convolutional Neural Networks by Dierentiating Class-specific Filters https://arxiv.org/abs/2007.08194
A study to improve the interpretation performance by constraining CNN filters such that one filter is responsible for one category. By multiplying the filters in the final layer with a trainable matrix of [0,1], they ensure that each filter uses only one category. The classification performance is also not compromised and the CAM visualization is better.
Activation function to determine what to activate

Activate or Not: Learning Customized Activation https://arxiv.org/abs/2009.04759
They treated ReLU and Swish in a unified manner and proposed the activation function ACtivationOrNot (ACON) as its general form. The activation function is composed of multiple learnable parameters, and they can freely change whether it is activated or not. They confirmed that this method improves the accuracy of object detection and image retrieval.
Prevent the loss function from falling into a sharp minima

Sharpness-Aware Minimization for Efficiently Improving Generalization https://arxiv.org/abs/2010.01412
To prevent the loss function from falling into a sharp minima, they propose a SAM that is optimized by adding perturbations to the parameters of the model such that the loss rises the most. The generalization performance is improved and is robust to label noise.
– – – – – – – – – – – – – – – – – – – – – – – – –
6.Deep fake
– – – – – – – – – – – – – – – – – – – – – – – – –
Deep Fake is becoming a major social issue, with the discovery of social networking accounts that use Deep Fake to make political statements. There was a Deep Fake detection competition on Kaggle. Detection methods are also evolving, with proposals to use biometric signals. It is also proposed to be used not for antisocial purposes, but to add reality to the images while protecting the privacy of the victims.
Detect Deep Fake using heartbeat

DeepRhythm: Exposing DeepFakes with Attentional Visual Heartbeat Rhythmshttps://arxiv.org/abs/2006.07634
This research detects Deep Fake by detecting changes in skin color due to heartbeat. The current generative model can make realistic movies, but it cannot reproduce the rhythm of subtle changes in skin color due to blood.
Methods used in the Deepfake detection competition

Detecting Deepfake Videos: An Analysis of Three Techniques https://arxiv.org/abs/2007.08517
An introduction to the method used in Kaggle’s Deepfake detection competition. The accuracy was improved by incorporating the "number of blinks" and "differences in the histogram" into the model.
Detecting Deep Fake by utilizing biological signals

How Do the Hearts of Deep Fakes Beat? Deep Fake Source Detection via Interpreting Residuals with Biological Signalshttps://arxiv.org/abs/2008.11363
DeepFake (fake video) using deep generative models has become a social problem, and They achieved 97.29% accuracy in DeepFake detection by using information from a biological signal called PPG. They also found that each generation model generates a unique false PPG-like signal, and were able to identify the generation model with 93.39%.
Protecting Victims with Deep Fake
In a documentary about persecuted people fleeing persecution, Deepfake technology is used to protect the victims by synthesizing their faces. Simply blurring the faces and using synthetic voices would lose the realism and lack intimacy, but using deepfake protects the privacy of the victims while maintaining the realism.
– – – – – – – – – – – – – – – – – – – – – – – – –
7.Generative models
– – – – – – – – – – – – – – – – – – – – – – – – –
In the generative model, rather than focusing on high resolution using GANs and the quality of the generated images, many the research focused on the interpretability of the latent space, generation using less data, and video editing applications. Also, many methods have been proposed using normalizing flow, which can do the inverse transformation from image to latent space, so we may be able to do many things with latent space.
Achieve both quality of images and orderliness of latent space by learning

High-Fidelity Synthesis with Disentangled Representation https://arxiv.org/abs/2001.04296
ID-GAN can achieve both quality of images and orderliness of latent space by learning an encoder to obtain a disentangled representation using such as β-VAE, and then using it to generate images with GAN.
Making use of latent space for higher resolutions

PULSE: Self-Supervised Photo Upsampling via Latent Space Exploration of Generative Modelshttps://arxiv.org/abs/2003.03808
In super-resolution, instead of taking loss that making high-resolution image from low-resolution image, they are looking for a latent space that would result in the same image when reverting to a lower resolution after a getting higher resolution from the pre-trained model. A method that does not require learning a large network and is not tied to network choices.
Transfer the structure while keeping the appearance unchanged between the two images

Structural-analogy from a Single Image Pair https://arxiv.org/abs/2004.02222
A GAN that can transfer the structure between two images without changing the appearance of the image. Like SinGAN, UpSampling image is complemented by Generator, but it is important to use it for style conversion as well. It can also be applied to video, which is great.
Combine GAN and AutoEncoder to get both organized latent space and high quality images

Adversarial Latent Autoencoderhttps://arxiv.org/abs/2004.04467
By splitting the Encoder and Decoder into two parts and letting them learn the distribution of intermediate representations adversely like GAN, authors propose an AutoEncoder that can achieve both Sota GAN level expressiveness and an organized latent space. Very high quality images have been generated.
Get high quality images with less data by adaptively varying the application probability of data augmentation

Training Generative Adversarial Networks with Limited Data https://arxiv.org/abs/2006.06676
Data augmentation (and the use of cycle consistency between augmented data in D) can be used to reduce overfitting. However, if the data augmentation application probability p is too large, the generated data will be affected by the augmentation. Therefore, they proposed a GAN to prevent over-fitting while preventing the influence on the generation by adaptively adjusting p. They succeeded in making a high quality image with only a few thousand data.
Combining Markov processes and variational inference to achieve FID3.1 in CIFAR10

Denoising Diffusion Probabilistic Modelshttps://arxiv.org/abs/2006.11239
The model that combines Markov processes and variational inference achieves FID3.17(SOTA) at unconditional CIFAR10. It is an AutoEncoder-like model, but with each layer considered as a stochastic process and parameterize the noise to be applied there.
High resolution image generation using normalizing flows.

SRFlow: Learning the Super-Resolution Space with Normalizing Flow https://arxiv.org/abs/2006.14200
High resolution image generation using normalizing flows. Unlike GAN, which requires adjustment of multiple loss terms, it learns by optimizing only the log-likelihood. It also uses invertible transformations, which allows one-to-one correspondence with latent representations, so it can also use high-resolution latent variables to perform style transformations.
Hierarchical VAE to generate high-definition images.

NVAE: A Deep Hierarchical Variational Autoencoderhttps://arxiv.org/abs/2007.03898
This is a study of hierarchical VAE to generate high-definition images. Instead of having the variance and average calculated directly at each stage, they design the distribution taking into account the relative average of the previous layer. In addition, there are various innovations such as swish activation, cells with SE modules, spectral norm, and depth-wise conv for wide receptive field with low calculation cost.
Get a disentangled representation with Hessian regularization

The Hessian Penalty:A Weak Prior for Unsupervised Disentanglement https://arxiv.org/abs/2008.10599
A study of obtaining a disentangled representation in the latent space of a GAN. Proposed a regularization term using Hessian so that a change in one direction i does not cause a change in other direction j(≠i). Fine-tuning is also possible for trained models.
Erasing People from Videos

Flow-edge Guided Video Completionhttps://arxiv.org/abs/2009.01835
This is a study of video completion using flow. The points of this method are: 1) Flow completion by detecting the missing edges and connecting them, 2) Obtaining the invisible parts of the video by using a slightly distant frame, and 3) Preventing the seams by using the gradient. Their method is quantitatively better than previous studies and can even eliminate people from the video.
Change the timing ,such as start timing and speed of movement, of the people in the video

Layered Neural Rendering for Retiming People in Video https://arxiv.org/abs/2009.07833
This model allows you to change the timing (start timing and speed of movement) of the people in the video. First, the each person, including occlusion, is separated from the background. Then the background and each person’s information are combined into a feature, and combined together to generate. They are able to change the timing of the water splashes and other features that accompany the movements of the people.
Stabilizing the GAN in an ordinary differential equation solver

Training Generative Adversarial Networks by Solving Ordinary Differential Equationshttps://arxiv.org/abs/2010.15040
The instabilities in training GANs arise from the integration error in discretising the continuous dynamics and they experimentally verify that well-known ODE solvers (such as Runge-Kutta) can stabilise training. Even without SpectralNorm, the learning was stable and produced excellent results.
Learn/generate high resolution from scratch with small amounts of data and small computational costs

TOWARDS FASTER AND STABILIZED GAN TRAINING FOR HIGH-FIDELITY FEW-SHOT IMAGE SYNTHESIShttps://openreview.net/forum?id=1Fqg133qRaI
GANs that can train/generate high resolutions (256²~1024²) from scratch with small amounts of data (100~1000) and small computational complexity (1 GPUx10hours~). The technical crux are an SLE module that combines information at each resolution and a constraint that allows reconstruction from Discriminator’s intermediate feature maps.
Generate images using the reverse-time of stochastic differential equations

SCORE-BASED GENERATIVE MODELING THROUGH STOCHASTIC DIFFERENTIAL EQUATIONShttps://openreview.net/pdf?id=PxTIG12RRHS
Unlike the usual generative model, which generates images by perturbing the noise, the stochastic differential equations are used to consider the continuum on which the noise evolves over time. With CIFAR10, IS 9.9 and FID 2.2 are achieved and 1024×1024 images can be also generated.
– – – – – – – – – – – – – – – – – – – – – – – – –
8.Machine learning with natural sciences
– – – – – – – – – – – – – – – – – – – – – – – – –
A lot of research combining natural science and Machine Learning has been published. There is a lot of research on speeding up and improving the accuracy of numerical simulations using machine learning, and results have been achieved in density functional theory and fluid simulation in physics.
Guide to the use of the Deep Learning for researchers in the natural sciences

A Survey of Deep Learning for Scientific Discovery https://arxiv.org/abs/2003.11755
A guide to the use of the Deep Learning for researchers in the natural sciences, with tutorials on a wide range of topics such as GAN, Transformer, Segmentation, etc. It may be good for beginners to understand the whole picture.
Impose physical constraints by treating the Kohn-Sham equation as a differentiable model

Kohn-Sham equations as regularizer : building prior knowledge into machine-learned physicshttps://arxiv.org/abs/2009.08551
In the problem of approximating physical simulations using density functional theory with neural networks, physical constraints can be imposed on ML models by treating the Kohn-Sham equation as a differentiable model. This greatly improves the accuracy of exchange-correlation term calculations.
Speed up fluid simulation more than 1000 times

Fourier Neural Operator for Parametric Partial Differential Equations https://arxiv.org/abs/2010.08895
This is a study of replacing the computation of integral kernels in Fourier space with neural networks, which, when applied to fluid simulations such as the Navier-Stokes equation, can be up to 1000 times faster than numerical simulations (FEM).
Simplify the model by using Cartesian coordinates

Simplifying Hamiltonian and Lagrangian Neural Networks via Explicit Constraints https://arxiv.org/abs/2010.13581
ex, the constrained Hamiltonian embedded in Cartesian coordinates simplifies the equations and makes learning easier. Accuracy/data efficiency is dramatically improved for complex systems such as the N-pendulum and gyroscope.
Finding magnetic eruptions in space with an AI assistant
The magnetopause, a phenomenon caused by the solar wind colliding with the Earth’s magnetosphere, is rare, so the team mobilized highly skilled scientists to monitor it full-time. This article is about the trial and error of using machine learning to create an assistant for that task. Since it is a difficult task, they improved the accuracy by using "Scientist in the Loop", which involves a scientist in the learning loop.
– – – – – – – – – – – – – – – – – – – – – – – – –
9.Analysis of deep learning
– – – – – – – – – – – – – – – – – – – – – – – – –
There has also been a lot of research published on how deep learning behaves. There is researches such that shows that blocks with similar structures emerge, or the role of each unit can be identified to manipulate the generated image. Personally, I think the study that models trained on adversarial samples are interpreted in a similar way to humans is very suggestive. Since adversarial training mainly uses high frequency noise, it may be that the model learns based on low-frequency information, i.e., the shape of the object. So it may lead to human-like decisions.
With Batch Norm the early learning period is practically training in a shallow network

Batch Normalization Biases Deep Residual Networks Towards Shallow Paths https://arxiv.org/abs/2002.10444
With BatchNorm, the contribution of a shallower paths of ResNet increase because of reduce variance of blocks’ output and It makes the training effective. Same effect can be obtained with introducing coefficient to to suppress variance in each block.
Models learned with adversarial noise make human-like decisions

Adversarially-Trained Deep Nets Transfer Better https://arxiv.org/abs/2007.05869
Compared to models trained normally, models trained with adversarial noise performed better in transfer learning. The results of the visualization showed that they were classifying in a more human-like sense, which may have influenced the results.
Whitening can negatively affect accuracy because it erases correlation

Whitening and second order optimization both destroy information about the dataset, and can make generalization impossible https://arxiv.org/abs/2008.07545
Whitening eliminates per-category data correlations and makes it impossible to distinguish between noise and signal; second-order optimization can have the same effect as whitening, and both have a negative impact on generalization performance. However, the training itself becomes faster, and with appropriate regularization, second-order optimization can produce good generalization performance, which may be good for situations where computational resources are a bottleneck.
Interpret the role of each unit in DNNs

Understanding the Role of Individual Units in a Deep Neural Network https://arxiv.org/abs/2009.05041
Research to interpret the role of each unit in DNNs. They found that there are units that learn concepts such as "tree" without explicitly giving them to the DNN. In GAN, They have succeeded in reducing trees from the image and attaching doors to buildings by manipulating the units that control each concept.
Scale laws in various data domains

Scaling Laws for Autoregressive Generative Modeling https://arxiv.org/abs/2010.14701
The study investigated the scale laws for computational resources, data amount and model size in various data domains. Thy find power law’s in all the studied domains, and the optimal model size for a domain shows a universal trend regardless of the domain.
Found that highly expressive networks learn similar representations in multiple layers

DO WIDE AND DEEP NETWORKS LEARN THE SAME THINGS? UNCOVERING HOW NEURAL NETWORK REPRESENTATIONS VARY WITH WIDTH AND DEPTHhttps://arxiv.org/abs/2010.15327
They find that deep or wide networks learn similar representations (called block structures) in multiple layers. This corresponds to the principal component of the layer-specific representation. These can be used for pruning with minimal impact on accuracy. In addition, there was a tendency for the wide model to be strong in scene discrimination and the deep model to be strong in consumer goods discrimination.
Underspecification Presents Challenges for Credibility in Modern Machine Learning

Poor generalization performance due to the presence of multiple model parameters with the same predictive performance https://arxiv.org/abs/2011.03395
They show that the problem of performance degradation of ML models deployed in real-world is related to underspecification, in which there are multiple solutions (combinations of model parameters) with the same predictive performance. This underspecification appears in all fields such as NLP, medical imaging, and computer vision, and should be tested with these parameters in mind.
– – – – – – – – – – – – – – – – – – – – – – – – –
10.Other research
– – – – – – – – – – – – – – – – – – – – – – – – –
The amount of computation required is growing rapidly, and research is showing that we need to do something about it. As exemplified by Big Transfer, GPT-3, and ViT, models have tended to grow in size in recent years, and we may need to do something to "democratize AI".
Dealing with differences in the distribution of training data and test data

Adversarial Validation Approach to Concept Drift Problem in Automated Machine Learning Systems https://arxiv.org/abs/2004.03045
For tasks with different train and test data distributions, They responded with Adversarial Feature Selection to match the distributions by training a classifier to discriminate between train and test and removing important features of the classifier until the scores were random. It is better than using validation with data close to test or weighting train according to the test distribution.
Game Balancing with AlphaZero

Assessing Game Balance with AlphaZero: Exploring Alternative Rule Sets in Chesshttps://arxiv.org/abs/2009.04374
This is an attempt to assess game balance using AlphaZero. To train AlphaZero on the chess mimics with minor rule changes such as no castling, we can see how experienced players view the game. In the history of chess, the rules have been repeatedly modified and settled on the current rules. Other games can be made to simulate such a thing with AlphaZero, they said.
The computational power required is getting bigger and bigger

The Computational Limits of Deep Learning https://arxiv.org/abs/2007.05558
The paper suggests that Deep Learning has improved the performance of many tasks by using vast amounts of computational power, but the paper suggests that it may stall depending on the development of hardware, as the computational power required is growing larger and larger. It suggests that the financial and environmental burdens are also becoming prohibitive, so drastic improvements may be necessary.
Making PCA a decentralizable algorithm

EigenGame: PCA as a Nash Equilibrium https://arxiv.org/abs/2010.00554
They interpreted PCA as playing a game in which each eigenvector maximizes its own utility function, and showed that it is equivalent to PCA in Nash equilibrium conditions. This can be implemented as a decentralizable algorithm, so they are able to perform a large-scale analysis of the neural network. It is an important result because with AutoEncoder it is not equivalent to recovering the principal components, nor is it disentangled.
– – – – – – – – – – – – – – – – – – – – – – – – –
11.Real world applications
– – – – – – – – – – – – – – – – – – – – – – – – –
In terms of real-world applications, there are cases in the legal and architectural fields where simple tasks are left to machine learning. I personally think that this kind of collaboration between machine learning and humans is a very realistic way to utilize machine learning, because it can reduce the development cost of machine learning models by allowing machine learning to solve simple problems, and improve productivity by allowing humans to focus on more complex tasks.
An A.I. Outperformed 20 Top Lawyers
The Pandemic is Replacing Lawyers with Robots Faster Than Ever – Best of Legal Tech
The AI outperforms top lawyers in the task of finding flaws in confidentiality agreements, and the AI can review a confidentiality agreement in 26 seconds, compared to 92 minutes for a normal lawyer. The response from lawyers has been positive, raising benefits such as allowing lawyers to focus on more complex projects.
Planting trees with the help of AI
Creating new tree shade with the power of AI and aerial imagery
The heat island effect is a public health concern, but planting trees in cities can prevent it; Google’s TreeCanopy Lab can use aerial images and machine learning to create a map showing the density of tree cover in a city, which will eliminate the need for manual tree surveys. Tree Canopy Lab has a short-term goal of planting and maintaining 90,000 trees by 2021 and continuing to plant 20,000 trees per year in a city of more than 503 square miles, and is already helping people in the city to meet this goal.
Using Machine Learning to Protect Animals from Poaching
How ZSL protects wildlife using ML acoustic classifier | Google Cloud Blog
Google and the international conservation charity ZSL have built a machine learning model that uses machine learning to identify gunshots. Acoustic sensors can detect gunshots from up to 1 km away, thereby assisting wildlife conservationists in their work.
Using AI to manage the detailed progress of a construction site
AI that scans a construction site can spot when things are falling behind
Buildots, a UK-Israeli startup, can monitor the condition of about 150,000 components (at three to four levels, such as installed) from a 360° camera mounted on its head. It’s already being installed on small construction sites and is expected to allow human managers to focus on more important tasks rather than monotonous tasks such as checking.
Using AI to analyze information warfare
The AI Company Helping the Pentagon Assess Disinfo Campaigns
The article discusses how natural language processing was used to analyze large amounts of news and information disseminated to respond to information warfare. It analyzes, among other things, that for months before the recent conflict between Armenia and Azerbaijan, information has been disseminated to deliberately portray one side of the country as the aggressor.
Simulation of taxation with reinforcement learning
The AI Economist: Improving Equality and Productivity with AI-Driven Tax Policies
A study that simulates the tax system using reinforcement learning. The economy is developed by alternately learning RL to improve the tax system in a board game Catan-like environment and workers (agents) to act optimally within that tax system. The economy scale with tax system obtained through reinforcement learning is ultimately larger than the one with the existing tax system.
– – – – – – – – – – – – – – – – – – – – – – – – –
12. Technical articles
– – – – – – – – – – – – – – – – – – – – – – – – –
I recommend stateof.ai’s summary document because it is very readable and covers a wide range of topics. The summary covers a wide range of topics, including research trends, location of human resources, ethical issues, expansion into military applications, and predictions for next year’s trends. In terms of research, the book says that huge data sets / huge models drive accuracy.
AI development in 2020
A report on the development of artificial intelligence in 2020.An extensive 177-page reporting document. The report covers a wide range of topics, including research trends, where the talent is located, ethical issues, expansion into military applications, and predictions of trends for the next year. In terms of research, huge datasets/massive models are driving accuracy, more papers using AI in biology, Pytorch is catching up to Tensorflow, etc.
Tool helps clear biases from computer vision
A tool for identifying biases in image datasets is now available. It uses existing image annotations and measurements such as number of objects, object-person co-occurrence, and country of origin of the image. For example, with respect to people and musical instruments, men were playing, women were not playing but were in the same space, and so on.
List of important papers by field of reinforcement learning
A list of important papers on reinforcement learning by field, compiled by OpenAI. About 3–10 important papers are listed in each field, such as Model Free, Model-based, Meta-RL, etc.
Data Scientist Interview Questions and Answers
120+ Data Scientist Interview Questions and Answers You Should Know in 2021
Over 120 expected questions and answers for the data scientist interview. The content is designed to test your basic knowledge of statistics/machine learning and its use in practice. Recommended for testing whether you have mastered the basics of statistics/machine learning.
Object Detection from 9 FPS to 650 FPS in 6 Steps
An article about speeding up the inference speed of object detection from 9FPS to 650FPS. The points are , avoiding CPU/GPU transfer, letting the GPU do the heavy computations, batch processing, using semi-accuracy, etc. This article is very convincing as it provides a rationale for the adaptation method by looking at what the CPU/GPU usage is at Nsight Systems every step of the way.
Careful explanation of machine learning model interpretability
The article explains why interpretability is important, and functions such as model transparency, with careful explanations and diagrams from the definitions. For example, with regard to transparency, the article explains factors such as whether a human being can reason through the same steps as the model, and whether each step is interpretable.
Why do methods of decision tree systems often outperform neural networks?
When and Why Tree-Based Models (Often) Outperform Neural Networks
This blog suggests that Neural networks fit models probabilistically, while decision tree methods fit deterministically, so neural networks are strong for things that cannot be represented by 0 or 1, such as images, or things that are ambiguous and have many exceptions, such as natural language, while decision tree methods are strong for many events because they can be processed by Yes/No.
– – – – – – – – – – – – – – – – – – – – – – – – –
– Past Articles
November 2020 summary October 2020 summary September 2020 summary