Semantic Analysis to Detect Duplicate Questions

Client Overview

  • One of the largest forums on the internet, known for it’s general question and answer platform
  • 1 bn + questions available on the portal, 100+ categories, 1k user defined tags
  • Provides a platform to raise questions in multiple categories not limited to computer technologies

Business Challenge

  • Labeling duplicate questions manually is a time consuming and repetitive task, but it’s equally important to represent a reasonable consensus
  • Predict pairs of questions, where questions with the same meaning
  • Label question as a duplicate and link with reference question
  • Deploy model online and run it in for batch of questions everyday overnight

VOLANSYS Contribution

VOLANSYS created a machine learning model capable of detecting duplicate questions on the forum. We deployed multiple models on production and monitored all models with a classification evaluation matrix every day to choose the best performing model.

Data Warehouse
  • About 1 GB of data
  • 500 mn question pair id
  • Text of question
  • 100 mn user information
  • 100+ categories of questions
  • 1k tags provided by users

Visualize keywords and tags used by users most frequently, categorize most used keyword by individual category of question


Language used in different questions and meaning of words used to represent category and tags provided by user


Compare text, words and sentiment of question by using word2vect, TF-IDF and language using NLTK framework

Actionable insights
  • Mark question as duplicate and add reference id
  • Percentage of duplicate question and improve platform to select similar questions without posting new question
Analytics Result
  • 30% in current affairs and 15% in finance category were the categories mostly asked duplicate questions
  • Identified about top 1k keywords represented similar question in top 10 categories

Displays pre-entered questions whenever a user starts typing a question based on keywords already used by different users

Benefits Delivered

  • Automatically predict duplicate questions for ease of search criteria and comments management
  • Visualization available to categories duplicate questions and important keywords