Dark Mode

Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

suchayarj/WhatsOnYourFace...BeHonest

Repository files navigation

What's on Your Face!?...Be Honest

MOTIVATION

My goal is to create a predictive model to identify fake/deceptive reviews on Sephora website utilizing One-Class Classfication Support Vector Machine

My curiosity in beauty industry has gone beyond finding out what kind of chemicals are used in cosmetics. (See my past project https://github.com/suchayarj/WhatsOnYourFace)

Online marketplace has become the dominant platform for people to buy and trade. Fake online reviews are definitely unavoidable. A few years back, there was a fake review scandal on Sunday Riley Company, where it forced its employees to write fake reviews on Sephora given a quite thorough walkthrough on how not to be caught as fake. I found it challenging and inspired to develop a model that's able to detect a pattern of fake reviews.

WORKFLOW

Collect data > Load into Pandas DataFrame > EDA > Natural Languange Processing > One Class Classification SVM Model

DATASET

  • Makeup dataset
    • Training set : creating my own dataset by performing a webscape from skincare category on www.makeupalley.com.
      • Dataset consists of approximately 230,000 entries, containing columns like Review Text, Rating, Product Name, Brand Name, User Name, and User Skin Types.
      • I will be using only the review text and rating of the reviews to train the model.
      • I am treating reviews from this website as authentic since this website's sole purpose is for beauty community to share honest reviews, while Sephora was made to sell products
    • Testing set: using existing sephora reviews dataset from Github, focusing on skincare category.
      • Reviews from Sephora are more diverse. It consists of both authentic and fake.
  • Dataset used to evaluate the model
    • Yelp Review Dataset from https://github.com/zzhang83/Yelp_Sentiment_Analysis. The reviews had been labeled to True & Deceptive. I need this in order to tune my model since one class SVM is unsupervised and I need to assure that my model works with labeled data before deploying

EDA/Text Analysis

  • Clean dataset (MakeupAlley & Sephora), filter review columns, prep them for text analysis using NLTK

  • Ratings:

    Sephora has average rating of approximately 4.3, while Makeup Alley has lower average at around 3.7. 70% of the rating on Sephora were 5 stars, which could be an indicator of fake reviews. On the other hand, Makeup Alley rating distribution seems more fair and authentic, with 4 stars being around 40%, while 2, 3, and 5 stars are about the same portion.

  • Reviews:

    • Review length distributions/stats

      Reviews on make up Makeup Alley tends to be longer than Sephora. The median of MA reviews is roughly 400-500 words per review, while Sephora's median is roughly 300 words per review

    • 50 most frequently used word in the reviews Makeup Alley and Sephora 50 most used words in reviews are fairly similar. However, Makeup Alley words include some negative/contrasting words, while Sephora reviews have more of positive adjectives (See words circled in red)

MODEL

  • Since Sephora has both honest & fake reviews, as well as my limited ability in obtaining instances on fake reviews, I decided to use One Class Classification from Support Vector Machine algorithm which is appropriate for Unsupervised Outlier Detection.
  • The one-class SVM is given points only from one class, and expected to learn a separation between members of that class and anything else during training.
  • One we feed new data to the model, and if that particualar data does not belong to authenric group, it will be classified as 'fake'

DEVELOP MODEL USING YELP DATASET

  • Tune parameter using Yelp review dataset since it's labeled so I can evaluate the model's performance.
  • Results of the Yelp model using true reviews only, deceptive reviews only, and mixed Yelp reviews, respectively parameter for OneClassSVM I used is nu= 0.1, kernel = 'rbf' (Gaussian), and gamma = 0.1

APPLY MAKEUP DATESET TO THE MODEL

  • Since One Class Classification SVM is an unsupervised learning, the confusion matrix for the model on Sephora prediction is not avaiable. But with nu = 0.1 on the training dataset, I can be confident that there's at most 10% error in the prediction. **Please note that nu can be adjusted per business use case.

CONCLUSION

  • Apparent difference between authentic and fake reviews are:
    • Length of the reviews: Fake reviews tend to be shorter than Authentic reviews
    • Type of words used: Fake reviews contained more positives like

FUTURE WORK

To improve the model I will...

  • Incorporate users information such as age and number of reviews into the model to improve accuracy
  • Expand stop words list (words such as skin, product, dry)
  • Adjust word filtering process
    • Not lowercasing all the words, since reviews that are all in caps tend to be less authentic.
    • Keep exclamation points since reviews with repeated exclamation points are more likely to be less authentic
  • Test model on the brand that has the scandal

See project utilizing neural network https://github.com/suchayarj/WhatsOnYourFace...FakeOrFact

Releases

No releases published

Packages

Contributors