Autor: Lucas de Souza Silva (Currículo Lattes)
Given the growth of neural network studies and their applications, the question arose: How could Vision Transformer Networks be used in the field of agronomy? To answer this question, this work aims to develop a neural network for semantic segmentation of weeds in soybean cultivation, using the Vision Transformer (ViT) model, a neural network that uses a mechanism of self-attention, to identify and carry out the weed segmentation. To address the problem, the dataset Deepweeds was used. The ViT model was compared with the networks: Segmenter, CvT, Resnet 50 v2, Deeplab v3+, Mobilenet, and Swin-Transformer. The model developed is composed of a ViT-Base backbone, with 12 layers and 86 million parameters. This network has components from a Resnet50 architecture, used for feature extraction, forming the final segmentation model with 16 layers and 120 million parameters. The segmentation results, presenting an accuracy of 93.89% pixel-by-pixel and Mean Intersection over Union (mIoU) of 0.626, were close to the BEiT model, a state-of-the-art network for the problem.