Health, Human Performance and Recreation Faculty Publications and Presentations

Classification of Twitter Vaping Discourse Using BERTweet: Comparative Deep Learning Study

William Baker, University of Arkansas, Fayetteville
Jason B. Colditz, University of Pittsburgh
Page D. Dobbs, University of Arkansas, FayettevilleFollow
Huy Mai, University of Arkansas, Fayetteville
Shyam Visweswaran, University of Pittsburgh
Justin Zhan, University of Arkansas, Fayetteville
Brian A. Primack, Oregon State University

Document Type

Article

Publication Date

7-21-2022

Keywords

Vaping, social media, deep learning, transformer models, infoveillance

Abstract

Background: Twitter provides a valuable platform for the surveillance and monitoring of public health topics; however, manually categorizing large quantities of Twitter data is labor intensive and presents barriers to identify major trends and sentiments. Additionally, while machine and deep learning approaches have been proposed with high accuracy, they require large, annotated data sets. Public pretrained deep learning classification models, such as BERTweet, produce higher-quality models while using smaller annotated training sets.

Objective: This study aims to derive and evaluate a pretrained deep learning model based on BERTweet that can identify tweets relevant to vaping, tweets (related to vaping) of commercial nature, and tweets with provape sentiment. Additionally, the performance of the BERTweet classifier will be compared against a long short-term memory (LSTM) model to show the improvements a pretrained model has over traditional deep learning approaches.

Methods: Twitter data were collected from August to October 2019 using vaping-related search terms. From this set, a random subsample of 2401 English tweets was manually annotated for relevance (vaping related or not), commercial nature (commercial or not), and sentiment (positive, negative, or neutral). Using the annotated data, 3 separate classifiers were built using BERTweet with the default parameters defined by the Simple Transformer application programming interface (API). Each model was trained for 20 iterations and evaluated with a random split of the annotated tweets, reserving 10% (n=165) of tweets for evaluations.

Results: The relevance, commercial, and sentiment classifiers achieved an area under the receiver operating characteristic curve (AUROC) of 94.5%, 99.3%, and 81.7%, respectively. Additionally, the weighted F1 scores of each were 97.6%, 99.0%, and 86.1%, respectively. We found that BERTweet outperformed the LSTM model in the classification of all categories.

Conclusions: Large, open-source deep learning classifiers, such as BERTweet, can provide researchers the ability to reliably determine if tweets are relevant to vaping; include commercial content; and include positive, negative, or neutral content about vaping with a higher accuracy than traditional natural language processing deep learning models. Such enhancement to the utilization of Twitter data can allow for faster exploration and dissemination of time-sensitive data than traditional methodologies (eg, surveys, polling research).

Comments

This article was published with support from the Open Access Publishing Fund administered through the University of Arkansas Libraries.

Citation

Baker, W., Colditz, J. B., Dobbs, P. D., Mai, H., Visweswaran, S., Zhan, J., & Primack, B. A. (2022). Classification of Twitter Vaping Discourse Using BERTweet: Comparative Deep Learning Study. JMIR Medical Informatics, 10 (7), e33678. https://doi.org/10.2196/33678

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Download

Included in

Interprofessional Education Commons, Quality Improvement Commons, Social Media Commons, Telemedicine Commons

COinS

Health, Human Performance and Recreation Faculty Publications and Presentations

Classification of Twitter Vaping Discourse Using BERTweet: Comparative Deep Learning Study

Document Type

Publication Date

Keywords

Abstract

Comments

Citation

Creative Commons License

Included in

Search

Links

Browse

Contact Us

Health, Human Performance and Recreation Faculty Publications and Presentations

Classification of Twitter Vaping Discourse Using BERTweet: Comparative Deep Learning Study

Authors

Document Type

Publication Date

Keywords

Abstract

Comments

Citation

Creative Commons License

Included in

Share

Search

Links

Browse

Contact Us