Fixing Bad
Introduction and Issue/Inspiration
In the midst of battling a global health pandemic, we are also fighting a wave of “infodemic”, i.e. the spread of incorrect information about the coronavirus. For fact check queries specific to COVID-19, current tools are either rule-based, where the user has to read through a large webpage or produce answers that are unrelated to the search query. In both cases, the search results lead to confusion for the average user who is not able to understand the returned information easily, leading to the spread of misinformation about the disease. In our submission, FixingBad we take a step towards creating a fact-checking tool for COVID-19 that uses machine learning and language processing along with google cloud APIs to answer user queries reliably in a concise fashion.
What it does
Our goal was to build a simplistic website that is operable by the average user and directly validates information about COVID-19 without producing confusing results. The web app consists of a front-page that inputs a COVID-19 related query from the user. This query is then processed that either confirms or denies the fact. In borderline cases, when our results are inconclusive, we return a short list of trusted webpages where the user can find more information.
How did your project evolve with the support of the COVID-19 hackathon fund by Google Cloud?
With the COVID-19 hackathon fund, we hosted a public website for our project from where anybody who has the link can access our fact-checking tool. Additionally, using the google cloud credits given to us, we are able to support the internal backend of the website which needs compute resources for the Neural Natural Language Processing engine.
How you built it
For the first phase, we used open-source NLP libraries like Spacy and NLTK to produce tokenization of the input strings. Once we have the tokenization, we normalize the sentiment of the phrase by looking at the negation tokens. Then we perform text pre-processing to get to a standardized representation of the string. We then obtained the use of the recently popular deep learning model BERT to obtain an embedding of the representation. For the second phase, we tuned the Google Cloud API to obtain articles and snippets from them in suitable preference order. The NLP pipeline had to be augmented to handle non-standard text on websites. Finally, in the third phase, we used a cosine similarity based metric to compare query and article embeddings.
Challenges you ran into
For the first phase, the major challenge was to ensure that the embedding robustly handles sentiment changes in the input. While this is an open problem in the NLP community, we tried to resolve it approximately by pruning the negative tokens in the query. For the second phase, we had to deal with inconsistent formatting in the webpages leading to non-robust API responses (where a slight meaningless change in the input query leads to very different responses) and the resulting noise in the final deep representations. For the final phase, determining a threshold on the similarity metric which distinguishes between the True and False response was challenging due to the large number of ways similar queries can be posed by the user.
Accomplishments you are proud of
Since most fact-checking websites are either rule-based or look at the query at a high level, we found that understanding the low-level details of the queries leads to a large improvement in the quality of results. To verify this, we compared with Google’s very own Fact-Checking API and found that we are able to provide results to the end-user that are much less confusing and much more concise. In a relatively short time, we were able to use multiple cloud APIs (Google Cloud Custom Search and Fact-Checking), state-of-the-art Natural Language Processing tools (BERT, Spacy), and integrate all components to create an end-to-end web application using Flask.
What you learned
Sleep is for the weak! and how to successfully use unfamiliar software and interfaces like Flask, Bert, Custom Google Cloud API.
What’s next for your project
In the future, we plan to make our web app more robust to small changes in the input query that lead to large sentiment changes in what the user wants to ask. We also plan to use a machine learning model that can learn the similarity threshold to distinguish between a “true” and “false” response.
What Google Cloud products did you use to build your project?
- GCP Compute Engine
- Google Custom Search API
Ambar Pal
Johns Hopkins University
Parmi Thakker
Johns Hopkins University