Data Analysis Movie – Obtaining Insights on Van Orders

Hi guys! I’ve created a short “movie” on analyzing van orders.

Watch my video here!

Hope you guys enjoy it and I’d appreciate your feedback 🙂 Below is a brief description of the movie and the data I used –

Part 1 – Use SQL to obtain insights

Vanorders – contains information on all orders, including

  • idvanOrder: order ID, the primary key
  • requestor_client_id: client account ID
  • servicer_auth: driver ID
  • total_price: price of the order
  • order_status: whether the order is completed (=2), cancelled (=3) or expired (5)

Vaninterest – an entry is created whenever a driver picked up an order. The driver can reject the order  then another driver could pick it up. In this case 2 entries will be shown in Vaninterest for the same order.

  • idvaninterest: the primary key
  • idvanOrder: order ID
  • servicer_auth: driver ID

Part 2 – Assess whether a system change in the order allocation system improved order match time

A system change was introduced at midnight on 30 March 2017 on the order allocation system. The objective of the change is to improve order match time. I would like to analyze whether this system change has achieved its objective.


  1. Orders from 24 March – 3 April 2017
  2. Time those orders were placed
  3. Time those orders were accepted by drivers

The difference between 2 and 3 is the order match time.


Predicting the Survival of Titanic Passengers (Part 2)

In my previous blog post, we learned a bit about what affects the survival of titanic passengers by conducting exploratory data analysis and visualizing the data. Then, the data was wrangled in order to prepare for modelling. In this blog post, I will use machine learning algorithms available at Python’s Scikit-learn library to predict which passengers in the testing data survived. A Decision Tree Classifier is used as an example and then its hyperparamaters are tuned to see if it improves prediction accuracy. I’ll also try using an ensemble of models to predict the results.

Continue reading

Learning more about ICOs

In this blog post, I would like to introduce the project we did at the 5th Unhackathon organized by Data Science Hong Kong. Our team’s project was to look at Initial Coin Offering (ICO) data extracted from ICObench to determine which ICOs are scams. Given we only had a few hours to work on it at the Unhackathon, we focused on data wrangling and visualization to learn more ICOs. After the Unhackathon, I had spent some time to conduct simple analysis of a few additional features. As the data was insufficient to determine which ICOs are scams, I played around with the data to see if there are any patterns on what makes an ICO profitable.

To summarize, I found that –

  • There are a lot of outliers in terms of the return on investment. It is hard to predict which ICOs will be profitable, at least based on the data we got.
  • ICObench provides a rating to each ICO, which is calculated based on ICObench’s algorithm and ratings from “Experts” (certain groups of ICO users). These 2 metrics have rather different rating standard and it seems that more weight is given to ICObench’s algorithm in determining the overall rating.
  • It seems that returns on investment are not strongly related to ratings based on our data. It is also affected by the huge outliers.

Continue reading

Predicting the Survival of Titanic Passengers (Part 1)

This is a classic project for those who are starting out in machine learning aiming to predict which passengers will survive the Titanic shipwreck. I will give this project a try using the training and testing data obtained from Kaggle.

Continue reading