“Since most of the existing point cloud object detection methods cannot fully adapt to the characteristics of the point cloud (such as sparsity), some key semantic information (such as object shape) cannot be well captured. This paper proposes a graph convolution (GConv) based on a hierarchical graph network (HGNet), which can directly use a point cloud as an input to predict a 3D bounding box.Shape attention graph convolution (SA-GConv) can describe the shape of an object through the position of the sword and magic point galaxy. The U-shaped network based on SA-GConv can obtain multi-level features through an improved voting module to generate candidates, and then one is based on the graph Candidate Reasoning for Convolution

“

Since most of the existing point cloud object detection methods cannot fully adapt to the characteristics of the point cloud (such as sparsity), some key semantic information (such as object shape) cannot be well captured. This paper proposes a graph convolution (GConv) based on a hierarchical graph network (HGNet), which can directly use a point cloud as an input to predict a 3D bounding box. Shape attention graph convolution (SA-GConv) can describe the shape of an object through the position of the sword and magic point galaxy. The U-shaped network based on SA-GConv can obtain multi-level features through an improved voting module to generate candidates, and then one is based on the graph The candidate inference module of convolution considers the global scene semantics to predict the bounding box. The performance of this framework on two large-scale point cloud data exceeds the current state-of-the-art model.

Paper background

Due to the sparsity of point clouds, some existing methods designed for grid data (such as CNN) do not perform well on point clouds. To solve this problem, some methods for point cloud data have recently been proposed. , Such as projection-based methods, volume-based convolution methods, and PointNet-based methods. The first two attempts to strictly convert point cloud data into grid structure data, while the latter one aggregates features without explicitly considering the geometric location of the points.

Compared with other methods, PointNet++ can retain the sparse characteristics of points, so it is widely used as the skeleton of the framework. When there are still some challenges that cannot be solved well, first of all, because the relative geometric position of the points is not considered, the use of PointNet++ as the backbone ignores some local shape information. Second, the structure of the framework does not make full use of multi-level semantics, which may ignore some information that helps target detection.

This paper proposes a hierarchical graph network (HGNet) based on graph convolution (GCONV) for point cloud-based 3D target detection. HGNet consists of three parts: a U-shaped network based on graph convolution (GUnet), a candidate generator and a candidate inference module (ProRe Module).Graph Convolution Based on Hierarchical Graph Network (HGNet)

The entire HGNet is trained in an end-to-end manner. In the framework of this article, the local shape information, multi-level semantics and global scene information (candidate features) of the point cloud have been fully captured, aggregated and merged by the hierarchical graph model, fully considering the characteristics of the point cloud data.

The main contributions of this article are as follows:

(A) A new hierarchical graph network (HGNet) is developed for 3D object detection on point clouds, which performs better than existing methods.

(B) A novel SA-(De)GConv is proposed, which can effectively aggregate features and capture the shape information of objects in the point cloud.

(C) A new GU-net is constructed to generate multi-level features, which is essential for 3D object detection.

(D) Utilizing global information, the ProRe module improves the effect by reasoning about candidates.

Thesis model

Fusion sampling

3D target detection has two frameworks: point-based and voxel-based. The former is more time-consuming and consists of two stages: candidate generation and prediction refinement.

In the first stage, SA is used for down-sampling to obtain higher efficiency and expand the receptive field, and FP is used to spread features for points lost in the down-sampling process. In the second stage, an optimization module optimizes the results of RPN to obtain more accurate predictions. SA is necessary for extracting point features. But FP and optimization modules will limit efficiency.

Shape attention graph convolution

A point cloud usually cannot clearly express the shape of an object, and the relative geometric position of its neighboring points can be used to describe the local shape around the point. This paper introduces a novel shape-attention graph convolution, which captures the shape of objects by modeling the geometric position of points.

For a point set X, each point is composed of its set position p_i and D-dimensional feature f_i. We want to generate an X’. This paper designs graph convolution to aggregate features from X to X’. Similar to the sampling layer of PointNet++, this article first samples n’points from n points. Usually K nearest neighbor (KNN) is used to retain local information in the sampling and use it as a central point feature.

Among them, g represents the relative position of i and j, through a convolution to transform the three-dimensional into one-dimensional, f is mlp, and then the product of the two is the knn of the center point, and the largest of them is the feature of i. The shape attention operation is different from the simple mlp-based operation mainly because of this g function. Although there is no normalization such as softmax in the attention, the output of g is the same as the attention, the weights of each point are then multiplied by the corresponding features.

GU-net

This paper designs a down-sampling module and stacks it repeatedly 4 times to form a down-sampling path, and stacks an up-sampling module twice to form an up-sampling method. Similar to FPN and GU-net, generate a feature pyramid of three point feature maps. The down-sampling uses FPS, and then builds a local area through KNN, and then uses SA-GConv to update the features. The process of up-sampling module is opposite to that of down-sampling module, which is mainly executed by SA-GConv.

Candidate generator

GU-net generated three point feature maps with multi-level semantics. Some previous methods (such as VoteNet) only use one feature map for target prediction. Even if the features of the lower layers are combined to calculate the features of the higher layers in the upsampling process, since the features of different layers provide various semantics, it is more beneficial to use the multi-layer features together for candidate generation. This paper proposes a candidate generator that uses an improved voting module as the main structure to predict the center of an object. This model converts multi-level features into the same feature space. Next, in order to aggregate features, the votes of Np are retained through FPS, which is similar to VoteNet, so that multi-level features are merged to predict the bounding box and its category.

Candidate Reasoning Module

Through the above steps, the multi-layer local semantic information has been well captured, but the global information has not been well learned, or some targets may only reflect a small part of the surface points in the point cloud. It is difficult to correctly identify it with such a small amount of information. The reasoning process is:

Where Hp represents the candidate feature tensor, P represents the relative position of the candidate

Thesis experiment

This article conducted experiments on two datasets, SUN RGB-D and ScanNet-V2.

In addition, ablation experiments are also carried out in this paper to prove the effectiveness of each mode.

in conclusion

This paper proposes a novel HGNet framework that learns semantics through hierarchical graph modeling.

Specifically, the author proposes a novel and lightweight shape attention graph convolution to capture local shape semantics, which aggregates the characteristics of the relative geometric positions of points. GU-net is constructed based on SA-GConv and SA-DeGConv, and a feature pyramid containing multi-level semantics is generated. The voting point of the feature pyramid will be located at the center of the corresponding object, and multi-level semantics will be further aggregated to generate candidates. Then use the ProRe module to merge and propagate features between candidates, thereby using global scene semantics to improve detection performance. Finally, the bounding box and category are predicted.

The Links: **PM10CSJ060** **PM20CEF060** **MITSUBISHI-IGBT**