ISYE6420 Bayesian Statistics

Course Link

This is the 5th course I’ve taken for my study in Georgia Tech. Bayesian statistics has been an interesting topic that I would always want to learn. This semester I felt ready to take on this journey.

I finished the course with 95%, another A secured. The direct impact of this course is I’m no longer afraid of the math representations in the papers since there are tons of equations to be written in the assignments. The videos of the courses only quickly skim through all the topics, the valuable parts are the demos of using WinBUGS and OpenBUGS. Most of the time I’ve been reading the textbook - http://statbook.gatech.edu/index.html, which elaborates the topics in details. In the end, I turned out to write most of the assignments in Matlab instead of WinBUGS. And it is surprisingly efficient to translate math equations from paper to code in Matlab.

I did meet one problem that using WinBUGS is more efficient. In the final exam, we were asking to find influential observations or outliers in the sense of CPO or cumulative. But Matlab does not have CPO function. To use WinBUGS with Mac, I figured out how to utilize AWS workspace. It is a virtual desktop image and works exactly like a normal windows system. The only downside is speed, it is a bit lagging. After installing WinBUGS in this virtual windows desktop, calculating CPO in WinBUGS is much easier, and potential outliers are defined as (CPO)i < 0.02.

Overall the course has met my expectation, though most of the stuff I learned from reading the textbook. I hate to say this but the lecture videos are so dry. The textbook is much better with detailed examples. The assignments are great for one to apply knowledge into practice. I couldn’t think of a lot of real-world applications other than a bayesian optimizer for hyperparameter fine-tuning. But it helps since most of my colleagues are from statistics background. It is useful to understand topics like MCMC methods or Hidden Markov Models to communicate with them.

Trading Book

Final Result Comparison

Project Link

1.Background

I spent the last 2 days to build an algorithm trading starter notebook. It is essentially using a different approach compared to what we are doing now for the FX trading project. I would like to have this as an alternative starting point and compare the performance between 2 different approaches. To make it easier for others to compare, I took the data scientist poor engineering practice to commit data into a repo. The main techniques are learned from the course Machine Learning for Trading, taught by Tucker Balch, who left Georgia Tech to work for JP Morgan now.

1.1 AWS setup

I would like to recommend my favourite setup, by using AWS Deeplearning AMI (Google Cloud or Azure are mostly the same based on my experience). A normal p2.xlarge would be more than sufficient. If you prefer to work with Jupyter notebook, Fast.ai has awesome documentation about the setup.

1.2 Alternative setup

Another highly recommended tool is Google’s colab. I almost always use it for an experiment. The only thing is we need a bit of setup to use Google drive to insert data. This post showed how to connect Colab to Gdrive.

2.Method

Financial data are normally time series data. So sequential models like LSTM is a naturally a good choice, just like we used in our internal project. But in this notebook, we embedded time-series information into technical indicators, then for each day, apart from price, there are several technical indicators taking historical information as part of the input. In this way, we can use frameworks like Gradient boosted trees or Q-learning to train our dataset.

2.1 Assumption

We assume the Efficient Market Hypothesis is not holding, or at least semi-strong or strong form do not hold. But it should be a common-sense for quantitative trading hedge fund like Renaissance Technologies. There should be some signals or correlations in stock prices, but not for all. We need some methods to find them out.

2.2 Pipeline Demo

The processing pipeline is shown in the README.md.

The target of the model is 3 positions: HOLD, BUY and SELL. Each day we have price information about one stock, with selected technical indicators containing historical information. We trained the model to understand the position to take for each day, then based on the positions, we can find the holdings. Subsequently, we use daily holdings to calculate the orders we should make. Eventually, our final product is a order book of days we BUY or SELL particular stocks.

2.2.1 Backtesting

The starting point of backtesting is orders file. We should treat backtesting separately, and it is probably the most important thing of the whole pipeline. What we need to make sure is that the backtesting result and forward testing result are similar. This is a crucial point. But not in the discussion of this post. This notebook is served as a starting point of exploration.

3.Result

The experiment results without too much fine-tuning are shared in the notebook.

In the experiment, the ML model is performed much better, but I set the risk free rate to 0 and market impact to the minimum. There many more concerns about the market environment. So to make sure the model would perform well in the real market, we need to spend extra effort in fine-tuning backtesting model.

4.Future work

There are several things I would like to try out to make this notebook starter more robust.

  • Use deep reinforcement learning approach.
  • Use more mature frameworks like LightGBM, and process with more data.
  • Try stacking and other ensembling methods.
  • Integrate with news data.
  • Apply to Two Sigma’s kaggle competition

CS6300 Software Development Process

Course Link

First of all, this is an easy course even for people like me with no prior Java experience. My final grade should be around 97%(A).

The course itself is great, containing many key concepts in software engineering. It even worth revisiting after the semester end. Syllabus shown here.

Also, I learned how to use IntelliJ and Android Studio for the semester projects. That’s a nice start since I always want to embeded tensorflow.js into an Android to build a ML based simple app. The most interesting experience I got from this course is to work distantly with 3 other teammates from the states. It’s definitely a culture shock. Basically the whole project is done asynchronously. Now I understand with the highly developed workflow, it’s totally possible to work remotely. Cutting off unnecessary meetings may not be a bad idea to boost productivity.

Finally, this might be a common problem for this master course. I need to consolidate information from all channels: Slack, Piazza, YouTube for relavant videos, and so on. The lecture itself is very interesting but the assignment will often take some extra efforts.

Finally, this is the recommended resources for this course:

  • Git
    • https://www.atlassian.com/git/tutorials
    • https://github.com/progit/progit2
  • Java
    • https://www.codecademy.com/learn/learn-java
    • https://www.guru99.com/java-tutorial.html
  • IntelliJ
    • https://www.youtube.com/watch?v=Bld3644bIAo
    • https://www.youtube.com/watch?v=c0efB_CKOYo&list=PLPZy-hmwOdEXdOtXdFzyx_XCnrF_oD2Ft
  • Android Studio
    • https://www.youtube.com/watch?v=g9YblXBQ5uU&t=11s
    • https://www.youtube.com/watch?v=dFlPARW5IX8&t=694s
  • Flow Chart
    • http://lucidchart.com/

CS6476 Computer Vision

Course Link

Firstly, this course is very intense to include many topics in one semester. Especially, the weekly assignments are all hard problems, which would take majority of the study time. The ideal way is to go through all materials before the beginning of semester.

Eventually, I scored 89.27% for this course, almost a highest B one can get. I don’t really care about getting straight As, but this is a bit unfortunate since I was only 0.7% away.

The big lesson learnt is that, I should trust reviews and experiences. I’ve been warned multiple times from all channels that CNN project is a monster, but I still went with it since I thought I knew it well even before this course. But the reality is different. The final project is just so demanding and my past CNN experience is almost irrelevant. If I chose EAR or MHI, it would be definitely a much easier one.

Intro & Syllabus

Fortunately, I have a similar OCR project to do this semester, so some of the methods I do apply to my work. This could be the best case scenario for a part-time Master.

Almost Year End

I have got a feeling that, especially for my master course, no matter how hard one thing seems to be at the very beginning, once you start the progress, it no longer feels the same.

This is my very first try of blogging, so instead of making it a clear story about one particular topic, it is more like a summary of the progress.

Work - OCBC AI Lab

Last year when I was attending Nuno’s class on Negotiation, one of the projects was 10 No’s. You have to get 10 rejections for the same task, and Nuno said you would soon find out how hard it is to get repeatedly rejected. That project eventually landed me in OCBC AI Lab.

I didn’t take a break, left Wego at 31st Jul and joined OCBC 3 days later. There are pros & cons between the two, and you can never knew beforehand. Nowadays I can focus on machine learning projects, not bothering with some BI stuff like making dashboards, but I no longer have the privilege of doing things all on the cloud.

Projects in AI Lab are not linear progress. It’s more like you keep trying until find all the dead ends. Then restart again. It’s quite an experience, I guess that’s why they need to hire PhDs. They just so used to this type of back and forth progress. The key is keep trying.

These are the projects I’m currently working on:

Course Difficulty Start Date Progress Note
Enterprise Chatbot Hard Aug’18 Deployment Luis.ai & DialogFlow
Card Fraud Hard Jul’19 Phase 2 Anormaly Detection
Cheque OCR Hard Sep’19 POC CNN + RNN

Study - Georgia Tech - M.S. CS

During the last days in Wego, apart from Nuno’s class, I’ve also got my GMAT score (700). It wasn’t that high but enough for me to start applying some schools. In the end, I didn’t try to apply for an MBA program but instead several part-time master courses. I chose the Master of Science in Computer Science from Georgia Institute of Technology, because other programs seem to care more about charging high fees. This program OMSCS seems to be the most affordable choice while the quality seems great from reviews.

The specialization I decide to do is Machine Learning, which is a natural choice since I’ve been doing it for years. The details of each course I would elaborate on future posts. But here is the general plan:

Course Difficulty Semester
CS7638 Artificial Intelligence for Robotics Hard 19’Spring
CS7646 Machine Learning for Trading Medium 19’Summer
CS6476 Computer Vision Hard 19’Fall
CS6300 Software Development Process Easy 19’Fall

Side tracks

I finished most of the interesting specializations on Coursera while I was with Wego. A bit too much when I’m looking back now. This year I stopped Coursera since I’m already on my master degree journey. However, I did go back to school to take CS6101 for 2 consecutive semesters. They were really difficult courses.

  • First semester (18’Fall) - NLP following Stanford’s CS224N
  • Second semester (19’Spring) Deep Reinforcement Learning following UC Berkeley’s CS285
  • Subsequently (19’Summer), I was granted CES to take Udacity’s Deep Learning Nano Degree

Others

I’m still making time for Gym and exercise. However, since I injured my ankle, I’ve skipped most of my regular soccer and basketball games.

This year I decided to finally pick up my piano classes. Conveniently, there is a piano school 50m away from my place. Though the teacher asked me to try ABRSM grade 7, I felt too tired to make another commitment. Nothing can be enjoyable anymore if you put an exam on it.

These are the 2 songs I would like to master in next year.

  • Piano Sonata No. 8 in C minor, Op. 13 III. Rondo: Allegro
  • Prelude and Fugue in C minor, BWV 847

And by the very end of next year, hopefully, I can start this two:

  • Chopin: Fantaisie-impromptu in C-Sharp Minor, Op. 66
  • Beethoven: The Piano Sonata No. 14 in C♯ minor “Quasi una fantasia”, Op. 27, No. 2

I believe now and then you should stop and think about long term goals. My next year top priority would still be work and study. But apart from that, instead of more sidetrack studies, I would like to try to get some Kaggle exposure and tech sharing. Use Medium to replicate SOTA results would be a good start, eventually, I’d also like to try a publication. May need some help from Prof on this part.

Data Science is an exhausted journey, I believe the best way through is to keep it fun.

Satisfaction = Perception - Expectation

Travel

I need to list it separately since I’ve been spending too much on this category.

Travel always play a big part of my life. I believe that it is such a beautiful world, one need to see it through their own eyes.

Time Places Sub-places Score
Dec’18 China Beijing, Pingyao, Xi’An ⭐⭐
Jan’19 China Dalian ⭐⭐⭐⭐⭐
May’19 Japan Kobe, Himeji, Kyoto, Osaka ⭐⭐⭐⭐
Aug’19 Central Europe Germany, Czech, Slovenia, Croatia, Montenegro, Hungary, Slovakia ⭐⭐⭐⭐⭐
Oct’19 Myanmar Yangon, Bagan ⭐⭐⭐⭐
Dec’19 China, Korea Shenzhen, Chengdu, Chongqing, Ningbo, Seoul, Dalian ⭐⭐⭐