ISYE6420 Bayesian Statistics

14 Dec 2020

This is the 5th course I’ve taken for my study in Georgia Tech. Bayesian statistics has been an interesting topic that I would always want to learn. This semester I felt ready to take on this journey.

I finished the course with 95%, another A secured. The direct impact of this course is I’m no longer afraid of the math representations in the papers since there are tons of equations to be written in the assignments. The videos of the courses only quickly skim through all the topics, the valuable parts are the demos of using WinBUGS and OpenBUGS. Most of the time I’ve been reading the textbook - http://statbook.gatech.edu/index.html, which elaborates the topics in details. In the end, I turned out to write most of the assignments in Matlab instead of WinBUGS. And it is surprisingly efficient to translate math equations from paper to code in Matlab.

I did meet one problem that using WinBUGS is more efficient. In the final exam, we were asking to find influential observations or outliers in the sense of CPO or cumulative. But Matlab does not have CPO function. To use WinBUGS with Mac, I figured out how to utilize AWS workspace. It is a virtual desktop image and works exactly like a normal windows system. The only downside is speed, it is a bit lagging. After installing WinBUGS in this virtual windows desktop, calculating CPO in WinBUGS is much easier, and potential outliers are defined as (CPO)i < 0.02.

Overall the course has met my expectation, though most of the stuff I learned from reading the textbook. I hate to say this but the lecture videos are so dry. The textbook is much better with detailed examples. The assignments are great for one to apply knowledge into practice. I couldn’t think of a lot of real-world applications other than a bayesian optimizer for hyperparameter fine-tuning. But it helps since most of my colleagues are from statistics background. It is useful to understand topics like MCMC methods or Hidden Markov Models to communicate with them.

Sentiment Analysis from Word2Vec to BERT - A Comprehensive Guide

25 Apr 2020

Overview
Dataset and Exploratory Data Analysis
Text Preprocessing
Word Embeddings Approach
LSTM-Based Approach
BERT-Based Approach
Model Evaluation and Comparison
Results and Performance Comparison
Conclusion

Sentiment analysis is one of the fundamental tasks in Natural Language Processing (NLP), with applications ranging from social media monitoring to customer feedback analysis. This comprehensive guide walks through different approaches to sentiment analysis, from traditional word embeddings to state-of-the-art transformer models, using the IMDB movie reviews dataset as our example.

Overview

We’ll explore various techniques for sentiment analysis, implementing each approach with practical code examples. Our journey will cover:

Data exploration and preprocessing
Traditional word embedding approaches
Advanced neural architectures
Modern transformer-based solutions

Let’s begin with loading and examining our dataset.

Dataset and Exploratory Data Analysis

The IMDB dataset contains 50,000 movie reviews split evenly between training and test sets, with balanced positive and negative sentiments. Let’s explore this data:

import pandas as pd
import numpy as np
from datasets import load_dataset
import matplotlib.pyplot as plt
import seaborn as sns

# Load IMDB dataset
dataset = load_dataset("imdb")
train_data = dataset["train"]
test_data = dataset["test"]

# Convert to pandas for easier analysis
train_df = pd.DataFrame(train_data)
test_df = pd.DataFrame(test_data)

# Basic statistics
print(f"Training set size: {len(train_df)}")
print(f"Test set size: {len(test_df)}")
print(f"\nLabel distribution:\n{train_df['label'].value_counts()}")

# Text length distribution
train_df['text_length'] = train_df['text'].str.len()

plt.figure(figsize=(10, 6))
sns.histplot(data=train_df, x='text_length', bins=50)
plt.title('Distribution of Review Lengths')
plt.xlabel('Length of Review')
plt.ylabel('Count')
plt.show()

This initial analysis reveals several important characteristics of our dataset:

25,000 training examples and 25,000 test examples
Perfectly balanced classes (50% positive, 50% negative)
Variable review lengths, with most reviews between 500 and 2500 characters

Text Preprocessing

Before applying any modeling technique, we need to clean and standardize our text data. Here’s a comprehensive preprocessing pipeline:

import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

class TextPreprocessor:
    def __init__(self):
        self.lemmatizer = WordNetLemmatizer()
        self.stop_words = set(stopwords.words('english'))
    
    def clean_text(self, text):
        # Convert to lowercase
        text = text.lower()
        
        # Remove HTML tags
        text = re.sub(r']+>', '', text)
        
        # Remove special characters and digits
        text = re.sub(r'[^a-zA-Z\s]', '', text)
        
        # Remove extra whitespace
        text = re.sub(r'\s+', ' ', text).strip()
        
        return text
    
    def process(self, text, remove_stopwords=True):
        # Clean text
        text = self.clean_text(text)
        
        # Tokenize
        tokens = word_tokenize(text)
        
        # Remove stopwords and lemmatize
        if remove_stopwords:
            tokens = [self.lemmatizer.lemmatize(token) 
                     for token in tokens 
                     if token not in self.stop_words]
        else:
            tokens = [self.lemmatizer.lemmatize(token) 
                     for token in tokens]
        
        return ' '.join(tokens)

# Preprocess the data
preprocessor = TextPreprocessor()
train_df['processed_text'] = train_df['text'].apply(preprocessor.process)

This preprocessing pipeline:

Converts text to lowercase
Removes HTML tags and special characters
Tokenizes the text
Removes stopwords (optional)
Lemmatizes words to their base form

Word Embeddings Approach

Let’s implement sentiment analysis using Word2Vec embeddings with TF-IDF weighting:

from gensim.models import Word2Vec
from sklearn.feature_extraction.text import TfidfVectorizer

# Prepare data for Word2Vec
tokenized_reviews = [review.split() for review in train_df['processed_text']]

# Train Word2Vec model
w2v_model = Word2Vec(sentences=tokenized_reviews,
                    vector_size=100,
                    window=5,
                    min_count=5,
                    workers=4)

# Function to get word vectors
def get_word_vector(word):
    try:
        return w2v_model.wv[word]
    except KeyError:
        return np.zeros(100)  # Return zeros for OOV words

# Create TF-IDF weighted Word2Vec
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(train_df['processed_text'])

def get_weighted_word_vectors(text):
    words = text.split()
    word_vectors = np.array([get_word_vector(word) for word in words])
    tfidf_weights = tfidf.transform([text]).toarray()[0]
    weighted_vectors = word_vectors * tfidf_weights[:, np.newaxis]
    return np.mean(weighted_vectors, axis=0)

LSTM-Based Approach

Next, let’s implement a more sophisticated approach using bidirectional LSTM:

import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Prepare data
MAX_WORDS = 10000
MAX_LEN = 200

tokenizer = Tokenizer(num_words=MAX_WORDS)
tokenizer.fit_on_texts(train_df['processed_text'])

X_train = pad_sequences(
    tokenizer.texts_to_sequences(train_df['processed_text']),
    maxlen=MAX_LEN
)
y_train = train_df['label'].values

# Build LSTM model
def create_lstm_model(vocab_size, embedding_dim=100):
    model = tf.keras.Sequential([
        tf.keras.layers.Embedding(vocab_size, embedding_dim, 
                                input_length=MAX_LEN),
        tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, 
                                    return_sequences=True)),
        tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
        tf.keras.layers.Dense(64, activation='relu'),
        tf.keras.layers.Dropout(0.5),
        tf.keras.layers.Dense(1, activation='sigmoid')
    ])
    return model

# Create and compile model
model = create_lstm_model(MAX_WORDS + 1)
model.compile(optimizer='adam',
             loss='binary_crossentropy',
             metrics=['accuracy'])

# Train model
history = model.fit(
    X_train, y_train,
    epochs=5,
    batch_size=32,
    validation_split=0.2,
    callbacks=[
        tf.keras.callbacks.EarlyStopping(
            monitor='val_loss',
            patience=2
        )
    ]
)

BERT-Based Approach

Finally, let’s implement sentiment analysis using BERT, representing the current state-of-the-art:

from transformers import (
    BertTokenizer, 
    BertForSequenceClassification, 
    TrainingArguments, 
    Trainer
)
import torch
from torch.utils.data import Dataset

# Custom dataset class
class IMDBDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length=512):
        self.encodings = tokenizer(texts, 
                                 truncation=True,
                                 padding=True,
                                 max_length=max_length)
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) 
                for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

# Initialize tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=2
)

# Create datasets
train_dataset = IMDBDataset(
    train_df['processed_text'].tolist(),
    train_df['label'].tolist(),
    tokenizer
)

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
)

# Create trainer and train
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
)

trainer.train()

Model Evaluation and Comparison

Let’s create a comprehensive evaluation framework:

from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    roc_curve,
    auc
)

def evaluate_model(y_true, y_pred, y_prob=None, model_name=""):
    # Print classification report
    print(f"\nClassification Report for {model_name}:")
    print(classification_report(y_true, y_pred))
    
    # Plot confusion matrix
    cm = confusion_matrix(y_true, y_pred)
    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
    plt.title(f'Confusion Matrix - {model_name}')
    plt.ylabel('True Label')
    plt.xlabel('Predicted Label')
    plt.show()
    
    # Plot ROC curve if probabilities are available
    if y_prob is not None:
        fpr, tpr, _ = roc_curve(y_true, y_prob)
        roc_auc = auc(fpr, tpr)
        
        plt.figure(figsize=(8, 6))
        plt.plot(fpr, tpr, color='darkorange', lw=2, 
                label=f'ROC curve (AUC = {roc_auc:.2f})')
        plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
        plt.xlim([0.0, 1.0])
        plt.ylim([0.0, 1.05])
        plt.xlabel('False Positive Rate')
        plt.ylabel('True Positive Rate')
        plt.title(f'ROC Curve - {model_name}')
        plt.legend(loc="lower right")
        plt.show()

Results and Performance Comparison

After training and evaluating all models, here are the key findings:

Word2Vec + TF-IDF:
- Accuracy: ~86%
- Fast training and inference
- Lightweight model size
Bidirectional LSTM:
- Accuracy: ~89%
- Better handling of long-range dependencies
- Moderate training time
BERT:
- Accuracy: ~93%
- Best overall performance
- Longest training time and largest model size

Conclusion

Our journey through different sentiment analysis approaches reveals several key insights:

Model Selection Trade-offs:
- Simple word embedding approaches provide a good baseline with minimal computational requirements
- LSTM models offer a good balance of performance and complexity
- BERT achieves the best results but requires significant computational resources
Practical Considerations:
- For production systems, consider the trade-off between accuracy and inference time
- BERT’s superior performance might be worth the computational cost for accuracy-critical applications
- For real-time applications with limited resources, LSTM or even Word2Vec approaches might be more appropriate
Future Directions:
- Explore domain-specific pre-training
- Investigate lightweight transformer architectures
- Consider multi-task learning approaches

The choice of model should ultimately depend on your specific use case, taking into account factors like accuracy requirements, computational resources, and latency constraints.

Trading Book

26 Feb 2020

Final Result Comparison

Project Link

1.Background

I spent the last 2 days to build an algorithm trading starter notebook. It is essentially using a different approach compared to what we are doing now for the FX trading project. I would like to have this as an alternative starting point and compare the performance between 2 different approaches. To make it easier for others to compare, I took the data scientist poor engineering practice to commit data into a repo. The main techniques are learned from the course Machine Learning for Trading, taught by Tucker Balch, who left Georgia Tech to work for JP Morgan now.

1.1 AWS setup

I would like to recommend my favourite setup, by using AWS Deeplearning AMI (Google Cloud or Azure are mostly the same based on my experience). A normal p2.xlarge would be more than sufficient. If you prefer to work with Jupyter notebook, Fast.ai has awesome documentation about the setup.

1.2 Alternative setup

Another highly recommended tool is Google’s colab. I almost always use it for an experiment. The only thing is we need a bit of setup to use Google drive to insert data. This post showed how to connect Colab to Gdrive.

2.Method

Financial data are normally time series data. So sequential models like LSTM is a naturally a good choice, just like we used in our internal project. But in this notebook, we embedded time-series information into technical indicators, then for each day, apart from price, there are several technical indicators taking historical information as part of the input. In this way, we can use frameworks like Gradient boosted trees or Q-learning to train our dataset.

2.1 Assumption

We assume the Efficient Market Hypothesis is not holding, or at least semi-strong or strong form do not hold. But it should be a common-sense for quantitative trading hedge fund like Renaissance Technologies. There should be some signals or correlations in stock prices, but not for all. We need some methods to find them out.

2.2 Pipeline Demo

The processing pipeline is shown in the README.md.

The target of the model is 3 positions: HOLD, BUY and SELL. Each day we have price information about one stock, with selected technical indicators containing historical information. We trained the model to understand the position to take for each day, then based on the positions, we can find the holdings. Subsequently, we use daily holdings to calculate the orders we should make. Eventually, our final product is a order book of days we BUY or SELL particular stocks.

2.2.1 Backtesting

The starting point of backtesting is orders file. We should treat backtesting separately, and it is probably the most important thing of the whole pipeline. What we need to make sure is that the backtesting result and forward testing result are similar. This is a crucial point. But not in the discussion of this post. This notebook is served as a starting point of exploration.

3.Result

The experiment results without too much fine-tuning are shared in the notebook.

In the experiment, the ML model is performed much better, but I set the risk free rate to 0 and market impact to the minimum. There many more concerns about the market environment. So to make sure the model would perform well in the real market, we need to spend extra effort in fine-tuning backtesting model.

4.Future work

There are several things I would like to try out to make this notebook starter more robust.

Use deep reinforcement learning approach.
Use more mature frameworks like LightGBM, and process with more data.
Try stacking and other ensembling methods.
Integrate with news data.
Apply to Two Sigma’s kaggle competition

CS6300 Software Development Process

01 Jan 2020

Course Link

First of all, this is an easy course even for people like me with no prior Java experience. My final grade should be around 97%(A).

The course itself is great, containing many key concepts in software engineering. It even worth revisiting after the semester end. Syllabus shown here.

Also, I learned how to use IntelliJ and Android Studio for the semester projects. That’s a nice start since I always want to embeded tensorflow.js into an Android to build a ML based simple app. The most interesting experience I got from this course is to work distantly with 3 other teammates from the states. It’s definitely a culture shock. Basically the whole project is done asynchronously. Now I understand with the highly developed workflow, it’s totally possible to work remotely. Cutting off unnecessary meetings may not be a bad idea to boost productivity.

Finally, this might be a common problem for this master course. I need to consolidate information from all channels: Slack, Piazza, YouTube for relavant videos, and so on. The lecture itself is very interesting but the assignment will often take some extra efforts.

Finally, this is the recommended resources for this course:

Git
- https://www.atlassian.com/git/tutorials
- https://github.com/progit/progit2
Java
- https://www.codecademy.com/learn/learn-java
- https://www.guru99.com/java-tutorial.html
IntelliJ
- https://www.youtube.com/watch?v=Bld3644bIAo
- https://www.youtube.com/watch?v=c0efB_CKOYo&list=PLPZy-hmwOdEXdOtXdFzyx_XCnrF_oD2Ft
Android Studio
- https://www.youtube.com/watch?v=g9YblXBQ5uU&t=11s
- https://www.youtube.com/watch?v=dFlPARW5IX8&t=694s
Flow Chart
- http://lucidchart.com/

CS6476 Computer Vision

06 Dec 2019

Course Link

Firstly, this course is very intense to include many topics in one semester. Especially, the weekly assignments are all hard problems, which would take majority of the study time. The ideal way is to go through all materials before the beginning of semester.

Eventually, I scored 89.27% for this course, almost a highest B one can get. I don’t really care about getting straight As, but this is a bit unfortunate since I was only 0.7% away.

The big lesson learnt is that, I should trust reviews and experiences. I’ve been warned multiple times from all channels that CNN project is a monster, but I still went with it since I thought I knew it well even before this course. But the reality is different. The final project is just so demanding and my past CNN experience is almost irrelevant. If I chose EAR or MHI, it would be definitely a much easier one.

Intro & Syllabus

Fortunately, I have a similar OCR project to do this semester, so some of the methods I do apply to my work. This could be the best case scenario for a part-time Master.

Older Newer

Nan

ISYE6420 Bayesian Statistics

Sentiment Analysis from Word2Vec to BERT - A Comprehensive Guide

Table of Contents

Overview

Dataset and Exploratory Data Analysis

Text Preprocessing

Word Embeddings Approach

LSTM-Based Approach

BERT-Based Approach

Model Evaluation and Comparison

Results and Performance Comparison

Conclusion

Trading Book

1.Background

1.1 AWS setup

1.2 Alternative setup

2.Method

2.1 Assumption

2.2 Pipeline Demo

2.2.1 Backtesting

3.Result

4.Future work

CS6300 Software Development Process

CS6476 Computer Vision