This article will share about my experience on the market prediction competition organized by CrunchDao. The competition has ended and the description can be found here: https://www.crunchdao.com/live/adialab
In short, the competitors need to be able to rank the performance of each assets in a pool in a daily basis. The spearman correlation coefficient between competitor’s predicted rank and actual rank after market close is used to evaluate the competitors’ result in a weekly basis for as long as 3 months upon competition ended. The competitors with highest spearman correlation coefficient will win be the winner.
As I have strong interest in predicting stock market movement, I took up the challenge and hope that I can sharpen my data science skills after completing the competition.
My out-of-sample scoring after 12 weeks of testing was rank 261 out of 4484 competitors. My overall score is 3.762.
The training dataset was provided by CrunchDao. For X_train, there are a total of 463 columns including date, asset id, and the remaining 461 unknown features of each asset id at the particular date. The features name are kept confidential by CrunchDao.
For the label to be predicted which is y_train in the dataset, you can see from the image below that y is the actual performance rank of the different assets on the particular date.
After some data exploration, I noticed that the number of assets is not constant across all dates.
Before training, feature engineering was performed to create more useful features. The average of all the features at each row was calculated.
def create_newfeatures(X_train):
max_date = np.max(X_train['date']) + 1
min_date = np.min(X_train['date'])
max_feats = len(X_train.columns)
X_train = X_train.drop(['id'],axis=1)
X_train_filter = X_train.drop(['date'],axis=1)#create statistics feature
X_train_filter['mean'] = X_train_filter.mean(axis=1)
X_train_filter=X_train_filter[['mean']]
new_X_train = pd.concat([X_train, X_train_filter], axis=1)
new_X_train = new_X_train.fillna(0)
return new_X_train
After feature engineering, feature selection was performed to down-select the important features out of the 462 features for training to increase the performance of the model. Random forest regressor was used to train and test the entire dataset. Features with importance larger than 0.002 were selected for actual training.
def get_important_features(new_X_train:pd.DataFrame, X_train:pd.DataFrame, y_train:pd.DataFrame,model_directory_path: str = "resources"):y_train_copy=y_train[['y']]
print("Splitting (X_train, y_train) in X_train_local, X_test_local, y_train_local, y_test_local")
X_train_1, X_test_1, y_train_1, y_test_1 = temporal_train_test_split(
new_X_train,
y_train,
test_size=0.2
)
X_train_1 = X_train_1.drop(['date'], axis=1)
X_test_1 = X_test_1.drop(['date'], axis=1)
y_train_1 = y_train_1.drop(['date'], axis=1)
y_train_1 = y_train_1.drop(['id'], axis=1)
y_test_1 = y_test_1.drop(['date'], axis=1)
y_test_1 = y_test_1.drop(['id'], axis=1)
# First we build and train our Random Forest Model
from sklearn.ensemble import RandomForestRegressor
#rf = RandomForestClassifier(max_depth=10, random_state=42, n_estimators = 50).fit(X_train_copy, y_train_copy)
rf = RandomForestRegressor(max_depth=10, n_estimators = 2, verbose=3)
rf.fit(X_train_1, y_train_1)
importances = rf.feature_importances_
df_importances = pd.DataFrame(importances,columns=['importance'])
df_importances['feature'] = X_train_1.columns.values.tolist()
df_importances=df_importances.loc[df_importances['importance']>0.002]
result = df_importances['feature'].tolist()
important_feature_pathname = Path(model_directory_path) / "result.csv"
df_importances.to_csv(important_feature_pathname)
X_train_date = X_train[['date']]
X_train_date.head()
#filter out those non-important features/ columns
X_train_important = new_X_train[new_X_train.columns.intersection(result)]
X_train_important = pd.concat([X_train_date, X_train_important], axis=1)
return X_train_important
After feature engineering and selection, the training dataset (X_train & y_train) need to be further split into train and test set. I used a test size of 20%. A customized function to split the train and test set in a sequential timely manner was created as shown below. This way can help to retain the temporal sequence of the data, instead of using the random split.
def temporal_train_test_split(X_train_loc, y_train_loc, test_size=0.2):
unique_dates = X_train_loc.date.unique()
split_date = unique_dates[int(len(unique_dates)*(1-test_size))]
X_train_local = X_train_loc[X_train_loc['date'] <= split_date]
X_test_local = X_train_loc[X_train_loc['date'] > split_date]y_train_local = y_train_loc[y_train_loc['date'] <= split_date]
y_test_local = y_train_loc[y_train_loc['date'] > split_date]
return X_train_local, X_test_local, y_train_local, y_test_local
print("Splitting (X_train, y_train) in X_train_local, X_test_local, y_train_local, y_test_local")
X_train_local, X_test_local, y_train_local, y_test_local = temporal_train_test_split(
X_train,
y_train,
test_size=0.2
)
The data is scaled using standard scaler.
scaler = StandardScaler()
scaler.fit(X_train_NN.iloc[:,1:])X_train_NN_scaled = scaler.transform(X_train_NN.iloc[:,1:])
X_test_NN_scaled = scaler.transform(X_test_NN.iloc[:,1:])
A custom loss was created to penalize the loss heavily if the actual and the prediction are on different sides of zero and also add slightly more penalty if prediction overshoots actual in any direction.
def custom_loss(y_true, y_pred):mse = tf.keras.losses.MeanSquaredError()
penalty = 10
# penalize the loss heavily if the actual and the prediction are on different sides of zero
loss = tf.where(
condition=tf.logical_or((tf.logical_and(tf.greater(y_true, 0.0), tf.less(y_pred, 0.0))), (tf.logical_and(tf.less(y_true, 0.0), tf.greater(y_pred, 0.0)))),
x=mse(y_true, y_pred) * penalty,
y=mse(y_true, y_pred) * penalty / 4
)
# add slightly more penalty if prediction overshoots actual in any direction
loss = tf.where(
condition=tf.logical_or((tf.logical_and(tf.greater(y_true, 0.0), tf.greater(y_pred, y_true))), (tf.logical_and(tf.less(y_true, 0.0), tf.less(y_pred, y_true)))),
x=loss * penalty / 5,
y=loss * penalty / 10
)
return loss
Next, a neural-network based model with two layers of hidden layer was created. The model was trained for 50 epochs using a batch size of 64 with custom loss and saved. Some techniques to stop the training earlier if the validation loss was not improved for consecutive 3 epochs.
model = tf.keras.Sequential([
tf.keras.layers.Dense(32,activation='relu'),
tf.keras.layers.Dense(1, activation='linear')
])es = EarlyStopping(monitor='val_loss',patience=3)
mse = tf.keras.losses.MeanSquaredError()
model.compile(loss = custom_loss,
optimizer = tf.keras.optimizers.RMSprop(lr=0.1),
metrics=['mae'])
history = model.fit(X_train_NN_scaled, y_train_NN, epochs=50, batch_size=64, validation_data=(X_test_NN_scaled,y_test_NN),callbacks=[es])
During inferencing, new features as engineered in training dataset will first be created for inferenced dataset. Columns will be filtered out using the list of important columns determined during feature selection stage. Scaling will be performed on the inferenced dataset using the fitted scaler during training stage. Trained model will finally be used to predict the performance rank in inferenced dataset.
Lastly, the spearman correlation was calculated between predicted rank and actual rank. As shown in the image below, a spearman’s correlation of 3.2 was obtained.
It comes to the end of my sharing. Hope it provide some interesting techniques that might help your analysis. If you would like to view my full code, you can visit my repository in Github. Thanks!
Be the first to comment