Utilize este identificador para referenciar este registo: http://hdl.handle.net/10437/12937
Registo completo
Campo DCValorIdioma
dc.contributor.authorPetukhova, Alina-
dc.contributor.authorFachada, Nuno-
dc.date.accessioned2022-06-17T09:31:01Z-
dc.date.available2022-06-17T09:31:01Z-
dc.date.issued2022-07-01-
dc.identifier.citationPetukhova, A. & Fachada, N. (2022). TextCL: a Python package for NLP preprocessing tasks. SoftwareX, 19. 101122.pt
dc.identifier.issn2352-7110-
dc.identifier.urihttp://hdl.handle.net/10437/12937-
dc.descriptionSoftwareX 19 (2022) 101122pt
dc.description.abstractPreprocessing text data sets for use in Natural Language Processing tasks is usually a time-consuming and expensive effort. Text data, normally obtained from sources such as, but not limited to, web scraping, scanned documents or PDF files, is typically unstructured and prone to artifacts and other types of noise. The goal of the TextCL package is to simplify this process by providing multiple methods suited for text data preprocessing. It includes functionality for splitting texts into sentences, filtering sentences by language, perplexity filtering, and removing duplicate sentences. Another functionality offered by the TextCL package is the outlier detection module, which allows to identify and filter out texts that are different from the main topic distribution of the data set. This method allows selecting one of several unsupervised outlier detection algorithms, such as TONMF (block coordinate descent framework), RPCA (robust principal component analysis), or SVD (singular value decomposition) and apply it to the text data. Keywords: Natural language processing ; Text filtering ; Outlier detectionen
dc.description.sponsorshipFundação para a Ciência e a Tecnologia, Portugal: UIDB/04111/2020 (COPELABS).pt
dc.formatapplication/pdfpt
dc.language.isoengpt
dc.publisherElsevierpt
dc.rightsopenAccesspt
dc.subjectINFORMÁTICApt
dc.subjectPROCESSAMENTO DE DADOSpt
dc.subjectPROCESSAMENTO DE TEXTOpt
dc.subjectLINGUAGEM NATURALpt
dc.subjectLINGUAGEM PYTHONpt
dc.subjectCOMPUTER SCIENCEen
dc.subjectDATA PROCESSINGen
dc.subjectWORD PROCESSINGen
dc.subjectNATURAL LANGUAGEen
dc.subjectPYTHON PROGRAMMING LANGUAGEen
dc.titleTextCL: a Python package for NLP preprocessing taskspt
dc.typearticlept
Aparece nas colecções:FE - Artigos de Revistas Internacionais com Arbitragem Científica

Ficheiros deste registo:
Ficheiro Descrição TamanhoFormato 
1-s2.0-S2352711022000802-main.pdf444.62 kBAdobe PDFVer/Abrir


Todos os registos no repositório estão protegidos por leis de copyright, com todos os direitos reservados.