Chapter 4 Mobile Recipes

More and more people access websites and applications from mobile devices, and we need to develop with these users in mind. Limited bandwidth, smaller screens, and new user interface interactions…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




BERT for identifying disasters from tweets

Jennifer Bochenek, Joseph Larson, Shibo Yao, and Yifan Yang

Think of the difference between real and non-real disasters as ‘house fire on Main st, family of three dead’ compared to ‘its nice to have my family gathered round the fire’, or ‘Gas explosion causes thousands in damage, no injuries’ compared to ‘Don’t miss out on the Bitcoin explosion!’ As humans, we can easily read the sentiment and intentionality behind each of the statements, and recognize which is an actual disaster and which is not, but computers are less skilled at this.

While we created many new variables in the course of examining this dataset (which is the topic for another medium article someday), the BERT model chosen only used the cleaned tweet text. Wordclouds are provided below to show the cleaned tweet text for both real and non-real disasters.

text of the real disasters, showcasing words such as new, fire, via, will, one, death, flood, storm, suicide bomber, hiroshima, etc.
Image 1. Text of real disasters (training set only)

Both wordclouds (top and bottom) are made using the same seed, so the words of similar frequency will show in the same position, size, and color to ease comparisons across the two. As we can see in the top, the most common word is ‘new’ and that is the same in the bottom image as well. The results diverge after that. These differences are the part that needs to be picked out and identified by any machine learning or deep learning used.

Image 2. Non-real disasters tweet text

Since using it, the model has been updated, but we will continue to use the one at the link to match our initial results. The model uses 24 hidden layers, the hidden size is 1,024 and has 16 attention heads. This model was pre-trained for English on Wikipedia and BooksCorpus. The input type is uncased, meaning that the text has to be set to lowercase before tokenization and any accents have to be stripped in order to properly utilize this model. If your input will have a mix of upper and lower case or if accent marks are important, then there are other pre-trained BERT models for you to use.

One of the steps is to build our helper functions:

Now we need to set up our layers:

And finally build the model:

See the model summary below. This is fairly simple BERT set-up.

Image 3. Model summary

The total training time on a Colab TPU was 3 hours, we determined that 3 epochs was appropriate because more than 3 resulted in overfitting the model to the training set (performance on the validation set decreased).

Image 4. Training performance of BERT

Our final model performance were as follows:

The normalized confusion matrix is below.

While the model is better at predicting non-disasters than real disasters, it performed better than the other models used on this same dataset. In total we used Multinomial Naïve Bayes, Support Vector Machines, K-Nearest Neighbors, Gradient Boosting, and BERT. Of them, BERT performed the best at predicting if a tweet involved a real or non-real disaster.

Table 1. Results of our ML methods, BERT was best

Add a comment

Related posts:

Recovering From the Trauma of the Trump Administration

When I listen to the news, I still find myself ready to cringe. We’ve grown used to one attack, one shock after another, continuing assaults on our lives or humanity. It’s been such a relief since…

3 Ways to Boost Instagram Followers Without Paying a Model

One of the best ways to get additional Instagram followers is by using models. Your “models” can help take your Instagram account to the next level of growth, regardless of how big or small you are…

Bibit Kurma Ajwa Pohon Buah Tropis Genjah Kualitas Unggulan Purbalingga Terunggul

bibit kurma ajwa pohon buah tropis genjah kualitas unggulan Bibit pohon kurma tropis genjah Kualitas Unggul Hasil dari persemaian biji Bibit kualitas unggul dari kami Pengiriman seluruh Indonesia Yuk…