[paper review, detail]Boosting LiDAR-based Semantic Labelingby Cross-Modal Training Data Generation

[paper review, detail]Boosting LiDAR-based Semantic Labelingby Cross-Modal Training Data Generation

2021. 4. 9. 15:13ㆍ[Paper Review]

citation : F. Piewak, P. Pinggera, M. Schafer, D. Peter, B. Schwarz, N. Schneider, ¨ M. Enzweiler, D. Pfeiffer, and M. Zollner, “Boosting LIDAR-based ¨ semantic labeling by cross-modal training data generation,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 11134 LNCS, 2019, pp. 497–513.

summary :

words:

viable : 실행가능한

cylindrical : 원통의

abundance : 풍부한

adequate : 적당한

Intro

1. 자율주행 차는 센서가 많다 / 센서들을 활용해 최대한 많은 데이터를 취득하고 융합한다 / 그방법으로 dynamic occupancy grid map[6] 이 있음/ object tracking , situation analysis , path planning 진행 [7][8] / 어려운 임무를 수행하기 위해, obstacle 과 freespace 를 구분할 뿐만 아니라, 주변 환경에 대해 깊은 유의미적(semantic) 해석이 가능해야 됨 / Computer vision 쪽에서, 딥러닝 기술 발전으로 semantic image labeling 이 발전되옴 [9] / 그러나 비슷한 퀄리티의 detailed semantic information 은 각 센서마다 추출되었다(성능, 가능성, 안전 고려) / 그리하여 본 논문은 LiLaNet 을 제안하고 효율적 딥뉴럴네트워크 구조로 point-wise, multi-class semantic labeling for lidar data.

/ 카메라 도메인에서 잘 적립되어왔고, 많은 양의 데이터 셋은 매우 중요한 역활을 함 / 동시에 , large-scale 포괄적(generic) 데이터셋은 ImageNet[10], COCO[11], KITTI[12], Cityscape[1] 이 있다. / 반대로 LiDAR 도메인으 내부 데이터 셋밖에 없다 [13]-[15] , 혹은 outdoor dataste[16] 은 정지해있는 sensors 로 취득한 high-resolution 이다 . / 그리하여 [17]-[19] 는 KITTI dataset 을 다시 resorted 하여 간접적indirect extraction of the desired LiDAR point semantic 을하였고, annotated object bounding boxes를 사용하고 , camera road detection benchmark 를 사용함 / 이런 간접적 추출은 가능한 training dataset 을 만들고, 이것은 상대적으로 cumbersome(무겁고 다루기힘든) 하고 작은 양의 semantic classes 에 한계가있다/ 그리하여 여기서 제안하는 Autolabeling process 를 이용한다./ autolabeling 방법은 많은 양의 semantically annotated lidar data 를 자동 생성(automated generation)하는데 어떻게 하냐면 registered reference camera image 를 direct transfer of high-quality semantic information 으로 바로 변환함 / reference image에 있는 semantic information 은 off-the shelf neural network 를 통해서 얻어짐 / 제안한 방법으로 얻은 dataset은 significantly boost lidar-based semantic labeling performance / Contribution 은 아래와 같음. / 1. 효율적 CNN architecture 을 제안하고 높은 퀄리티의 semantic labeling 을 한다(lidar point cloud data 에 대해서) / 2. lidar 기반 semantic labeling performance 의 boosting 을 위해 large-scale automated cross-modal traing data generation process 를 진행 / 3. quantitative evaluation of semantic labeling performance 를 통해, automated training data generation process 에 대해 평가함 /

Within the fields of mobile robotics and autonomous driving, vehicles are typically equipped with multiple sensors of complementary modalities such as cameras, LiDAR and RADAR in order to generate a comprehensive and robust representation of the environment [2–5]./ model provides the basis for high-evel tasks such as object tracking [7], situation analysis and path planning [8]. In order to master these high-level tasks, itis essential for an autonomous vehicle to not only distinguish between generic obstacles and free-space, but to also obtain a deeper semantic understanding of its surroundings. Within the field of computer vision, the corresponding task ofsemantic image labeling has experienced a significant boost in recent years due to the resurgence of advanced deep learning techniques [9]. However, detailed semantic information of similar quality has to be extracted independently from each of the sensor modalities to maximize system performance, availability and safety. Therefore, in this paper we introduce LiLaNet, an efficient deep neural network architecture for point-wise, multi-class semantic labeling of semi-dense LiDAR data.

Related Work :

1. LiDAR 기반 시맨틱 라벨링은 향상된 모바일 센서 기술의 가용성으로 인해 최근 몇 년 동안 더 높은 해상도와 더 긴 범위를 저렴한 비용으로 제공하여 주목을 받고 있다. LiDAR 기반 의미 레이블링의 다양한 제안된 접근 방식은 점별 3D 정보가 활용되는 방식에 의해 구별될 수 있다./

LiDAR- based semantic labeling has gained increased attention in recent years due to the availability of improved mobile sensor technology, providing higher resolution and longer range at reduced cost. The various proposed approaches of LiDAR-based semantic labeling can be discriminated by the way the point-wise 3D information is utilized.

2. 의미론적 라벨링을 위해 깊이 정보를 사용한 첫 번째 접근 방식은 RGB 데이터를 기반으로 했으며, RGB 이미지 데이터의 추가적 depth cnanel 을 보완한다[20, 21]. / 종종 스테레오 카메라가 조밀한 깊이 이미지를 만드는 데 사용되었고, 이 영상은 RGB 이미지와 융합되었다./Tosteberg [22]는 시간이 지남에 따라 축적된 LiDAR 센서의 깊이 정보를 사용하여 카메라 공간에 투사하는 기술을 개발했다. /축적은 전용 업샘플링 알고리듬 없이도 밀도가 증가된 깊이 이미지를 생성한다.

The first approaches using depth information for semantic labeling were based on RGB- data, which complements RGB image data with an additional depth channel [20, 21], allowing to recycle 2D semantic image labeling algorithms. / Often a stereo camera was used to create a dense depth image, which was then fused with the RGB image. / Tosteberg [22] developed a technique to use depth information of a LiDAR sensor accumulated over time to project it into the camera space. /The accumulation yields a depth image of increased density without requiring dedicated upsampling algorithms.

3. 3d lidar data 에 대한 다른 카테고리의 접근은 3d lidar data 를 unordered point cloud 로 취급한다, pointnetPointNet [23], PointNet++ [24] and PointCNN[25] / The PointNet architecture [23] 로컬 포인트 feature 와 글로벌 extracted fature vectors를 결합하여, semantic information 을 point-wise 기반하게 추론함/ 이 아이디어를 기반하고 확장하여, pointnet++ [24] 은 계층에 따른 pointnet 구조로 추가적 mid-level feature representation 을 생성하여 point neighborhood relation 을 다루는것을 향상시킴 / indoor scene 에 대해서는 성공적으로 실현을 하였지만 larege scale outdoor scenarios 는 미치지 못함 / PintCNN[25] unordered point clouds 를 기반으로 하여, modified convolution layers 를 소개하는데 어떤 layers 냐 하면 permutations and weighting of the input feature 한 구조를 소개함 / 이것은 기존의 tradional CNN 의 이점을 unordered point cloud processing 으로 가능하게함 / 그러나 이 방법론은 object detecion 에만 사용되고, semantic labeling of point cloud 에는 사용되지않음 /

A different category of approaches considers the 3D LiDAR data as an unordered point cloud, including PointNet [23], PointNet++ [24] and PointCNN [25]. The PointNet architecture [23] combines local point features with globally
extracted feature vectors, allowing for the inference of semantic information on a point-wise basis. Extending this idea, PointNet++ [24] introduces a hierarchical PointNet architecture to generate an additional mid-level feature representation for an improved handling of point neighborhood relations. Both approaches are evaluated successfully on indoor scenes but reach their limits in large scale outdoor scenarios. The PointCNN [25] approach is based on unordered point clouds as well, but introduces modified convolution layers extended by permutations
and weighting of the input features. This allows to transfer the advantages of traditional CNNs to unordered point cloud processing. However, the approach is only used for object detection and has not yet been applied to semantic labeling of point clouds.

4. 아직 lidar input data 를 representing 하는데 다른 방법은 SEGCloud [26] and OctNet [27] methods. Here a Voxel (SEGCloud) or an OctTree (OctNet) representation is created 이있고 convolution layers are extended to 3D convolutions 이다/ 위 방법은 input data 의 원래 3D 구조를 유지하고, spatial relations을 잘 preserving 한다/ 그러나 위 알고리즘은 cope with high sparsity of data 여야 하고, inference time 이 증가하고, memory requirements는 drastically increase for large-scale outdoor scenes. /

Yet another way of representing LiDAR input data is within cartesian 3D space, which is used in the SEGCloud [26] and OctNet [27] methods. Here a Voxel (SEGCloud) or an OctTree (OctNet) representation is created and the
convolution layers are extended to 3D convolutions. These approaches retain the original 3D structure of the input points, making them more powerful in preserving spatial relations. However, the algorithms have to cope with the high sparsity of the data, and inference time, as well as memory requirements, increase drastically for large-scale outdoor scenes.

5. 3d convolution 의 complexity 를 피하기 위해 2d view of input data 가 고려됨 / 2d view 를 통하여 최신 image 기반 deep learning algorith 은 적용됨 / 사용한느 경우에 따라, differenct viewpoints 혹은 virtual cameras 가 사용될수도 있다/ Caltagirone [19] top-view image of lidar point cloud 를 사용하여 labeling road points를 거리 환경에서 함/ 이 탑뷰 projection of lidar points는 road detection 에는 합당하나 mutual point occlusion 은 general semantic labeling task를 하기에는 어렵다 / 또 다른 방법으로는 2d view 는 종종 cylindrical projection of lidar point 를 통함, rotating lidar scanner 에 적합한 방법 , 이 케이스에서는 sensor view 가 dense depth image 까지 제공, 그 정보는 subsequetn processing step 에 advantageous 함 , 본 방법은 Wu et al[17] 이 input image 로 사용하여 SqeezeSeg architecture 를 위해 사용하여 a SqeezeNet-based [28] semantic labeling 을 하여 car,pedestrian,cyclist 구분 진행 , kittti data set[12]의 bounding boxes 를 사용하여 training , evaluation 에 필요한 point-wise semantic information 을 전달함 / Dwan et al[18]의 접근 방법은 Cylindrical projection of lidar data 를 CNN 기반한 Fast-Net architecture[29] 의 input 으로 사용하여 movable 과 non-movable point를 구분함 , Wu et al[17] 과 유사하게 kitti object detection dataset 를 사용하여 gt bounding box 라벨을 폐쇠된(enclosed) 포인트들에게 전달한다/

A possible solution to avoid the computational complexity of 3D convolutions is the rendering of 2D views of the input data. Based on such 2D views, state-of-the-art image-based deep learning algorithms can be applied. Depending on the use case, different viewpoints or virtual cameras may be used. Caltagirone et al.
[19] use a top-view image of a LiDAR point cloud for labeling road points within a street environment. This top-view projection of the LiDAR points is a valid choice for road detection, but the resulting mutual point occlusions generate difficulties for more general semantic labeling task as in our case. An alternative is to place the virtual camera origin directly within the sensor itself. The resulting 2D view is often visualized via a cylindrical projection of the LiDAR points (see Fig. 2), which is particularly suitable for the regular measurement layout of common rotating LiDAR scanners. In this case, the sensor view provides a dense depth image, which is highly advantageous for subsequent processing steps. Wu et al. [17] uses this type of input image for the SqeezeSeg architecture, which
performs a SqeezeNet-based [28] semantic labeling to segment cars, pedestrians and cyclist. The bounding boxes of the KITTI object detection dataset [12] are used to transfer the point-wise semantic information required for training and evaluation. The approach of Dewan et al. [18] uses the cylindrical projection of the LiDAR data as an input for a CNN based on the Fast-Net architecture [29] to distinguish between movable and non-movable points. Similar to Wu et al [17], the KITTI object detection dataset is used for transferring the ground-truth bounding box labels to the enclosed points.

6. Varga et al[30] 은 또다른 방법을 제안하는데 runtime(실시간) 으로 generate semantically labeled point cloud 진행하고, fisheye cameras 와 lidar sensors 를 융합 하여 사용함/ 첫번째로 pixel-wise semantic 을 camera images 에서 추출하여 cnn 모델 사용하여 cityscapes[1] 데이터를 사용 / 그 뒤에 , lidar points가 image 로 projection 되어 semantic information 으로 변화하기 위해 픽셀에서 3d point로 변함 / 그러나 LiDAR 포인트 클라우드 자체에 대한 sementic inforamtion 은 추론되지 않으며, 센서 양식의 공간 및 시간적 등록(temporal registration of the sensor modalities)은 여전히 과제로 남아 있다.(뭔말인지 노이해) / 본 논문에서는 [30] 논문의 아이디를 얻어 camera/lidar joint setup 으로 많은 양의 3d semantic training data 을 얻음 / 데이터는 deep neural netowrk 를 학습하기위에 사용되고 그리하여 point-wise semantic information 을 lidar data 에서 바로 얻는다 / 많은 양의 automatically generated data 와 자체적으로 진행한 manually annotated dataset 을 comining 하여 전체 semantic labeling performance 를 significantyl 하게 boost 함 /

Varga et al [30] propose an alternative method to generate a semantically labeled point cloud at runtime, based on a combined setup of fisheye cameras and LiDAR sensors. First, pixel-wise semantics are extracted from the camera images via a CNN model trained on Cityscapes [1]. Subsequently, the LiDAR points are projected into the images to transfer the semantic information from pixels to 3D points. However, no semantic information is inferred on the LiDAR point cloud itself, and spatial and temporal registration of the sensor modalities remains a challenge. In the present paper we take the idea of [30] one step further and utilize a joint camera/LiDAR sensor setup to generate large amounts of 3D semantic training data. The data is then used to train a deep neural network to infer pointwise semantic information directly on LiDAR data. We show that combining the large amounts of automatically generated data with a small manually annotated dataset boosts the overall semantic labeling performance significantly.

Method :

1. cnn architecture 기반 lilanet 으로 point-wise, multi-class semantic labeling of lidar data 진행 / image-based semantic labeling 정보를 lidar domain 으로 transfer 하는것이 목표 / cylindrical projection of 360도 회전하는 lidar scanner 를 사용하여 inputdata 를 network 에 삽입함/ efficient automated cross-modal data generation process 를 통하여 training 을 boost 시키고 그 방법은 autolabeling 이라함 /

We introduce a novel CNN architecture called LiLaNet for the point-wise, multi- class semantic labeling of LiDAR data. To obtain high output quality and retain eﬃciency at the same time, we aim to transfer lessons learned from image-based semantic labeling methods to the LiDAR domain. The cylindrical projection of a 360◦ point cloud captured with a state-of-the-art rotating LiDAR scanner is used as input to our networks. Training is boosted by an eﬃcient automated cross-modal data generation process, which we refer to as Autolabeling.We introduce a novel CNN architecture called LiLaNet for the point-wise, multiclass semantic labeling of LiDAR data. To obtain high output quality and retain
efficiency at the same time, we aim to transfer lessons learned from image-based
semantic labeling methods to the LiDAR domain. The cylindrical projection of
a 360◦
point cloud captured with a state-of-the-art rotating LiDAR scanner is
used as input to our networks. Training is boosted by an efficient automated
cross-modal data generation process, which we refer to as Autolabeling.

Lidar images :

1. cylindrical point cloud projection 을 통하여 2d 로 만들면 mutual point occlusion 으로부터도 자유로움 / 이 방법은 최적화된 2d convolution layer 를 사용하게 함 (CNN 구조) / voxel grid 혹은 octtrees 구조보다 inference time 이 drastically reduced 함 / cylindrical image 를 three-dimensional point cloud 로 representation 하더라도 loss of information 이 없음 /

LiDAR sensors measure the time of flight of emitted laser pulses in order to determine the distance of surrounding objects with high accuracy. In addition,
modern sensors even provide coarse reflectivity estimates on a point-wise basis.
For the following experiments we consider a Velodyne VLP32C LiDAR scanner,
which features 32 vertically stacked send/receive modules rotating around a
common vertical axis. While rotating, each module periodically measures the
distance and reflectivity at its current orientation, i.e. at the respective azimuth
and elevation angles. We combine the measurements of a full 360◦ scan to create
cylindrical depth and reflectivity images, as illustrated in Fig. 2. This projection
represents the view of a virtual 360◦ cylindrical camera placed at the sensor
origin. At ten revolutions per second, images of size 1800×32 pixels are obtained.
The cylindrical point cloud projection provides dense depth and reflectivity images which are free from mutual point occlusions. This allows for the application of optimized 2D convolution layers, as used with great success in
state-of-the-art image-based CNN architectures. In this way, inference time is
reduced drastically compared to the use of full 3D input representations such as
voxel grids or octtrees. Further, since measurement times and orientation angles
are known with high accuracy, it is straightforward to transform the cylindrical
image back into a full three-dimensional point cloud representation without any
loss of information. In cases where no laser reflection is received by the sensor, for example when
pointed towards the sky, pixels corresponding to the respective measurement
angles are marked as invalid in the resulting depth image.

Class Mapping : skip

LiLaNet Network Architecture :

1. lidar based semantic labeling 을 진행하고 cnn architecture 사용 / low resolution 과 extreme asymmetry in the aspect ratio of the used LiDAR image 에 cope with 하기 위하여, lilablock 을 사용, 그것은 googlenet inception module[31] 에서 영감받음 /

Autolabeling :

1. 2d 이미지의 annotation 보다 lidar-based semantic labeling 이 huge effort and entail(수반하다) higher cost / 그 이유는 additional spatial dimension 과 sparsity of the data 이다 / 직관적이지 않고 (non-intuitive ) cumbersome for human annotators 이다 / 이런 이류들로 하여 efficient automated process for lare-scale training data generation called autolabeling 을 제안함 /

2. 먼저 카메라로 semantic image 를 통하여 pixel-wise semantic labeling 진행

3. 그다음 point cloud data 가 projected into reference image plane 로 되고, image pixel semantic information 이 상응하는 lidar point로 변환함 / 하나의 mono camera 는 full point cloud 의 일부분만 커버가 되어서 multiple cameras 로 coverage 를 increased 함 / fully automated procedure semantically labeled point clouds 는 lilanet 를 사용함/ 상세한건 아래 설명함 /

In the first step, a high-quality pixel-wise semantic labeling of the reference
camera image is computed via state-of-the-art deep neural networks, as can be
found on the leaderboard of the Cityscapes benchmark [1]. Second, the captured
point cloud is projected into the reference image plane to transfer the semantic
information of the image pixels to the corresponding LiDAR points. While a
single reference camera will in general only cover a fraction of the full point cloud,
it is straightforward to extend the approach to multiple cameras for increased
coverage.
The described fully automated procedure yields semantically labeled point
clouds which can directly be used to train LiDAR-based semantic labeling networks such as LiLaNet. In the following subsections we describe the various
stages of the data generation process in more detail.

Semantic Image Labeling

4. [34] 를 참고하여 pixel-wise semantic labeling of camera image 진행(모델 그대로 사용한듯), 해당 네트워크는 Cityscapes datatset 으로 학습되었고 Intersection-over-Union (IoU) test score of 72.6% 를 original cityscapes label set 에 관련하여 얻음 / Autolabeling process 는 어떤 image-based reference network 를 사용하더라도 sufficient output quality 를 얻음 /

Point Projection

5. 3d points를 scanning lidar into the reference camera image plane, 여러 방면이 고려되야함 / lidar scanner 가 360 도로 돌면서 각 측정된 포인트들은 difference time 에 측정된것 / 반대로, camera image 는 in time 에 측정된것/ 그것을 해결하기위해 point-wise ego-motion correction using vehicel odometry 진행(우측이 correction 한것, 좌측이 미진행)

Dataset :

이후 내용은 논문참고

'[Paper Review]' 카테고리의 다른 글

[paper review,deatail] CNN-based Lidar Point Cloud De-Noising in Adverse Weather (0)	2021.04.08
[paper review] RandLA-Net: Efficient Semantic Segmentation of Large-Scale Point Clouds (0)	2021.03.25

JOB's home