© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms 1 E6893 Big Data Analytics: Financial Market Volatility.

© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms 1 E6893 Big Data Analytics: Financial Market Volatility Final Project Presentation Jimmy Zhong, Tim Wu, Oliver Zhou, John Terzis December 22, 2014

Map Reduce programming model used to generate feature matrix from raw price data across 100’s of symbols. Raw price data is first merged with feature symbols from a fixed set of user determined features on timestamp. Feature extraction is done on reducer by creating forward and backward looking volatility values for each timestamp for each symbol. Resultant feature matrix contains over 300 columns from a starting point of 12. Feature matrix can be further transformed using a script to perform time-series clustering on intra-day price activity. Feature Selection/Extraction using Hadoop 2

Spark was installed and pyspark used to perform cross-validated Ridge Regression using Stochastic Gradient Descent with the goal of producing a regressor that can predict volatility for some forward looking interval (60 Minute, 1 Day, 10 Day etc) for a given symbol. A combination of MLLIB and scikit learn were used since MLLIB did not have python bindings yet for cross-validated splitting of dataset. Spark was ran on data held in HDFS. Results obtained were tested on a hold out sample and R^2 calculated to show how much variance could be explained by the regressor. Supervised Learning on Spark using MLLIB 3

Time-Series Analysis: Forecasting multistep ahead base on GARCH model and calculate VAR Motivation: Real world financial time series has property called volatility clustering; that is periods of relative calm are interrupted by bursts of volatility. An extreme market movement might represent a significant downside risk to the security portfolio of an investor. Using RHadoop ecosystem to forecast the future volatility and calculate Value at Risk (VAR) can help investor to prepared for losses arising from natural or man-made catastrophes, even of a magnitude not experienced before. Algorithm: 1.Used PIG and Python script to pre-process the raw data (AAPL) then load it into Rstudio 2.Applied R code (TimeSeriesAnalysis.R). Calculated the return in percentage. 3.Applied GARCH modeling to forecast the future volatility and calculate VAR 4.Applied Extreme Value Theory (EVT) to fit a GPD distribution to the tails Result: 1.Calculated Forecast for the volatility and Value at Risk (VaR) at 99% confidence level (Loss is expected to be exceeded only 1% of the time). In this example, AAPL (2008 – 2009), we calculated that 99% probability the monthly return is above 4%. 2.Used statistical hypothesis tests (Ljung-Box) for autocorrelation in squared returns (p value ~0, reject the null hypothesis of no autocorrelations in the squared returns at 1% significance level). GARCH model should be employed in modeling the return timeseries. 4

Time-Series Analysis: Forecasting multistep ahead base on GARCH model and calculate VAR 5

Tail of the AAPL % Return dataQuantile-quantile plot 7

K-Means Clustering Goal is to attempt to relate different time intervals to stock volatility through clustering. Symbols: AIG, AMZN, PEP Vector Dimensions: Normalized Volume, Symbol Volatility +1 Day, VIX Volatility +1 Day, Time Interval Time Intervals: Period of Day, Day of Week, Fiscal Quarter, Year K-means clustering in R and Hadoop with cluster size of 3-4 Euclidean Distance Measure used since all features were real valued. 8

Cluster Results No strong correlation of time intervals to symbol volatility across all three sectors. No strong correlation between VIX volatility and symbol volatility. There is a significant relationship between volume and symbol volatility. 9

Logistic Regression Goal is to use classification model to separate variables out during feature selection and identify which ones generate the best predictive power Stock Symbols Tested: AIG, AMZN, PEP Parameters in Dataset: Normalized Volume, Symbol Volatility +1 Day, VIX Volatility +1 Day, Time Interval Targeted predicting when Symbol VIX Volatility would rise over.25, which historically is a rough cutoff between regime changes from low to high volatility market cycles. 10

Logistic Regression Results Measured by AUC (Area Under Curve) 1 is a True Positive and 0 is a True Negative, while.5 is completely Random Little to no relationship with time intervals to symbol volatility, but that may be skewed by market crashes VIX volatility and symbol volatility are nearly completely randomly related There is a significant relationship between volume and symbol volatility. 11

Questions ? 12

© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms 1 E6893 Big Data Analytics: Financial Market Volatility.

Similar presentations

Presentation on theme: "© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms 1 E6893 Big Data Analytics: Financial Market Volatility."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms 1 E6893 Big Data Analytics: Financial Market Volatility.

Similar presentations

Presentation on theme: "© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms 1 E6893 Big Data Analytics: Financial Market Volatility."— Presentation transcript:

Similar presentations

About project

Feedback