CornerNet: Detecting Objects as Paired Keypoints

这里的corner是角的意思而不是中心，也就是说，该网络为一阶段目标检测网络，预测的是目标的左上角和右下角两个“corner”而构成一个bbox
本模型是无需anchor box的方法，模型的输出是左上角的一个heatmap和右下角的一个heatmap和一个embedding vector来配对每对corner，每对配对的corner在第三个输出的相应位置上具有相同的embedding vector
网络结构如图：
提出了适合cornernet的maxpooling（corner pooling），pooling的输入是两个featuremap，输出是两个pooling的和；理论依据是corner并非local的，而左上角的corner需要往右和往下看，右下角的也类似。
作者认为，预测corner只需要看2个方向而预测中心需要看4个方向，因此预测corner更简单；而且，用corner来表示一张w*h的图里的bbox，复杂度是O（wh）（左上角有wh种情况，右下角有wh种情况，是加和，因为该模型的一个左下角只对应一个右下角），用anchor来表示需要 $O(w^2h^2)$ （位置有wh种情况，大小有wh种情况，是乘积，因为一个位置同时对应多种大小）
没有多尺度预测，然后对于corner的预测也有位置上的offset；backbone用的是hourglass network；对于左上角的heatmap有C个通道，C是类别数，没有设置背景类
对于每个目标，只有一个positive的左上角，因此其他的左上角都是negative的，但是当某个negative的左上角与positive的左上角的距离在某个范围内时，降低loss，这个范围是根据该左上角画出的bbox与gtbox的iou要大于某个值来决定的，reduction的程度根据距离不同而不同，为以positive corner为中心的2D 高斯分布
损失函数如下：
offset的预测被认为是用来解决网络downsampling中造成的误差，我觉得这个解释可以。
embedding的想法并非本文先提出的，反正采取的是两个loss来使得相关的corner的embedding更接近而不相关的更远：
具体embedding和hourglass net见这三篇文章：
Newell, A. and Deng, J. (2017). Pixels to graphs by associative embedding. In Advances in Neural Information Processing Systems, pages 2168-2177.
Newell, A., Huang, Z., and Deng, J. (2017). Associative embedding: End-to-end learning for joint detection and grouping. In Advances in Neural Information Processing Systems, pages 2274-2284.
Newell, A., Yang, K., and Deng, J. (2016). Stacked hourglass networks for human pose estimation. In European Conference on Computer Vision, pages 483-499. Springer