Data Analysis Movie – Obtaining Insights on Van Orders

Hi guys! I’ve created a short “movie” on analyzing van orders.

Watch my video here!

Hope you guys enjoy it and I’d appreciate your feedback 🙂 Below is a brief description of the movie and the data I used –

Part 1 – Use SQL to obtain insights

Vanorders – contains information on all orders, including

  • idvanOrder: order ID, the primary key
  • requestor_client_id: client account ID
  • servicer_auth: driver ID
  • total_price: price of the order
  • order_status: whether the order is completed (=2), cancelled (=3) or expired (5)

Vaninterest – an entry is created whenever a driver picked up an order. The driver can reject the order  then another driver could pick it up. In this case 2 entries will be shown in Vaninterest for the same order.

  • idvaninterest: the primary key
  • idvanOrder: order ID
  • servicer_auth: driver ID

Part 2 – Assess whether a system change in the order allocation system improved order match time

A system change was introduced at midnight on 30 March 2017 on the order allocation system. The objective of the change is to improve order match time. I would like to analyze whether this system change has achieved its objective.


  1. Orders from 24 March – 3 April 2017
  2. Time those orders were placed
  3. Time those orders were accepted by drivers

The difference between 2 and 3 is the order match time.


An Overview of TED Talks using Tableau

I’ve always been a huge fan of TED talks. TED talks cover a wide range of topics from technology, innovation, society and personal growth and have always been a huge source of inspiration for me. Therefore, I’ve decided to use this topic to practice building dashboards in Tableau. In this visualization, I hope to understand more about TED talks by asking the following questions:

  • How has TED talks changed across the years (e.g. number of talks, topics)?
  • What are the most popular TED talks?
  • What do audiences generally think about TED talks?

The data set used is from Kaggle, which covers 2,550 TED talks from 1972 to 22 Sep 2017. I have done some basic cleaning of the data, mostly on converting the format of ratings and themes. You can view the interactive Tableau dashboard by clicking the image.

TED Talks (1972 - 2017)
To summarize:
  • The number of TED talks significantly increased in 2009. This is also the same year when TEDx talks started.
  • Most TED talks are held in Feb – Mar, the month when the Annual TED conference is held. Most TED talks are held on Wednesdays and Thursdays.
  • Technology and science have always been a common topic for TED talks. Since 2015, there are more and more talks about innovation and society.
  • Audiences generally have positive perceptions on TED talks. Looking at the most popular TED talks, the top 10 most commented talks are viewed as inspiring, informative, fascinating and persuasive. The top 10 most viewed talks share similar ratings, and are also viewed as funny.
I have also build a recommender, which recommends 10 talks with the most specific rating, for example, the 10 most inspiring talks. Now I have some idea of what TED talks to watch next!

Predicting the Survival of Titanic Passengers (Part 2)

In my previous blog post, we learned a bit about what affects the survival of titanic passengers by conducting exploratory data analysis and visualizing the data. Then, the data was wrangled in order to prepare for modelling. In this blog post, I will use machine learning algorithms available at Python’s Scikit-learn library to predict which passengers in the testing data survived. A Decision Tree Classifier is used as an example and then its hyperparamaters are tuned to see if it improves prediction accuracy. I’ll also try using an ensemble of models to predict the results.

Continue reading

Learning more about ICOs

In this blog post, I would like to introduce the project we did at the 5th Unhackathon organized by Data Science Hong Kong. Our team’s project was to look at Initial Coin Offering (ICO) data extracted from ICObench to determine which ICOs are scams. Given we only had a few hours to work on it at the Unhackathon, we focused on data wrangling and visualization to learn more ICOs. After the Unhackathon, I had spent some time to conduct simple analysis of a few additional features. As the data was insufficient to determine which ICOs are scams, I played around with the data to see if there are any patterns on what makes an ICO profitable.

To summarize, I found that –

  • There are a lot of outliers in terms of the return on investment. It is hard to predict which ICOs will be profitable, at least based on the data we got.
  • ICObench provides a rating to each ICO, which is calculated based on ICObench’s algorithm and ratings from “Experts” (certain groups of ICO users). These 2 metrics have rather different rating standard and it seems that more weight is given to ICObench’s algorithm in determining the overall rating.
  • It seems that returns on investment are not strongly related to ratings based on our data. It is also affected by the huge outliers.

Continue reading

Predicting the Survival of Titanic Passengers (Part 1)

This is a classic project for those who are starting out in machine learning aiming to predict which passengers will survive the Titanic shipwreck. I will give this project a try using the training and testing data obtained from Kaggle.

Continue reading