Spurious Correlations: The Comedy and Drama of Statistics | by Celia Banks, Ph.D. | Feb, 2024


What are the research questions?

Why the heck do we need them?

We’re doing a “bad” analysis, right?

Research questions are the foundation of the research study. They guide the research process by focusing on specific topics that the researcher will investigate. Reasons why they are essential include but are not limited to: for focus and clarity; as guidance for methodology; establish the relevance of the study; help to structure the report; help the researcher evaluate results and interpret findings. ​In learning how a ‘bad’ analysis is conducted, we addressed the following questions:

(1) Are the data sources valid (not made up)?

(2) How were missing values handled?

(3) How were you able to merge dissimilar datasets?

(4) What are the response and predictor variables?

(5) Is the relationship between the response and predictor variables linear?

(6) Is there a correlation between the response and predictor variables?

(7) Can we say that there is a causal relationship between the variables?

(8) What explanation would you provide a client interested in the relationship between these two variables?

(9) Did you find spurious correlations in the chosen datasets?

(10) What learning was your takeaway in conducting this project?

How did we conduct a study about

Spurious Correlations?​

To investigate the presence of spurious correlations between variables, a comprehensive analysis was conducted. The datasets spanned different domains of economic and environmental factors that were collected and affirmed as being from public sources. The datasets contained variables with no apparent causal relationship but exhibited statistical correlation. The chosen datasets were of the Apple stock data, the primary, and daily high temperatures in New York City, the secondary. The datasets spanned the time period of January, 2017 through December, 2022.

​Rigorous statistical techniques were used to analyze the data. A Pearson correlation coefficients was calculated to quantify the strength and direction of linear relationships between pairs of the variables. To complete this analysis, scatter plots of the 5-year daily high temperatures in New York City, candlestick charting of the 5-year Apple stock trend, and a dual-axis charting of the daily high temperatures versus sock trend were utilized to visualize the relationship between variables and to identify patterns or trends. Areas this methodology followed were:

Primary dataset: Apple Stock Price History | Historical AAPL Company Stock Prices | FinancialContent Business Page

Secondary dataset: New York City daily high temperatures from Jan 2017 to Dec 2022: https://www.extremeweatherwatch.com/cities/new-york/year-{year}

The data was affirmed as publicly sourced and available for reproducibility. Capturing the data over a time period of five years gave a meaningful view of patterns, trends, and linearity. Temperature readings saw seasonal trends. For temperature and stock, there were troughs and peaks in data points. Note temperature was in Fahrenheit, a meteorological setting. We used astronomical setting to further manipulate our data to pose stronger spuriousness. While the data could be downloaded as csv or xls files, for this assignment, Python’s Beautiful soup web scraping API was used.

Next, the data was checked for missing values and how many records each contained. Weather data contained date, daily high, daily low temperature, and Apple stock data contained date, opening price, closing price, volume, stock price, stock name. To merge the datasets, the date columns needed to be in datetime format. An inner join matched records and discarded non-matching. For Apple stock, date and daily closing price represented the columns of interest. For the weather, date and daily high temperature represented the columns of interest.

From Duarte® Slide Deck

To do ‘bad’ the right way, you have to

massage the data until you find the

relationship that you’re looking for…​

Our earlier approach did not quite yield the intended results. So, instead of using the summer season of 2018 temperatures in five U.S. cities, we pulled five years of daily high temperatures for New York City and Apple Stock performance from January, 2017 through December, 2022. In conducting exploratory analysis, we saw weak correlations across the seasons and years. So, our next step was to convert the temperature. Instead of meteorological, we chose astronomical. This gave us ​‘meaningful’ correlations across seasons.

​With the new approach in place, we noticed that merging the datasets was problematic. The date fields were different where for weather, the date was month and day. For stock, the date was in year-month-day format. We addressed this by converting each dataset’s date column to datetime. Also, each date column was sorted either in chronological or reverse chronological order. This was resolved by sorting both date columns in ascending order.

The spurious nature of the correlations

here is shown by shifting from

meteorological seasons (Spring: Mar-May,

Summer: Jun-Aug, Fall: Sep-Nov, Winter:

Dec-Feb) which are based on weather

patterns in the northern hemisphere, to

astronomical seasons (Spring: Apr-Jun,

Summer: Jul-Sep, Fall: Oct-Dec, Winter:

Jan-Mar) which are based on Earth’s tilt.

​Once we accomplished the exploration, a key point in our analysis of spurious correlation was to determine if the variables of interest correlate. We eyeballed that Spring 2020 had a correlation of 0.81. We then determined if there was statistical significance — yes, and at p-value ≈ 0.000000000000001066818316115281, I’d say we have significance!

Spring 2020 temperatures correlate with Apple stock

If there is truly spurious correlation, we may want to

consider if the correlation equates to causation — that

is, does a change in astronomical temperature cause

Apple stock to fluctuate? We employed further

statistical testing to prove or reject the hypothesis

that one variable causes the other variable.

There are numerous statistical tools that test for causality. Tools such as Instrumental Variable (IV) Analysis, Panel Data Analysis, Structural Equation Modelling (SEM), Vector Autoregression Models, Cointegration Analysis, and Granger Causality. IV analysis considers omitted variables in regression analysis; Panel Data studies fixed-effects and random effects models; SEM analyzes structural relationships; Vector Autoregression considers dynamic multivariate time series interactions; and Cointegration Analysis determines whether variables move together in a stochastic trend. We wanted a tool that could finely distinguish between genuine causality and coincidental association. To achieve this, our choice was Granger Causality.

Granger Causality

A Granger test checks whether past values can predict future ones. In our case, we tested whether past daily high temperatures in New York City could predict future values of Apple stock prices.

Ho: Daily high temperatures in New York City do not Granger cause Apple stock price fluctuation.

​To conduct the test, we ran through 100 lags to see if there was a standout p-value. We encountered near 1.0 p-values, and this suggested that we could not reject the null hypothesis, and we concluded that there was no evidence of a causal relationship between the variables of interest.

Granger Causality Test at lags=100

Granger causality proved the p-value

insignificant in rejecting the null

hypothesis. But, is that enough?

Let’s validate our analysis.

To help in mitigating the risk of misinterpreting spuriousness as genuine causal effects, performing a Cross-Correlation analysis in conjunction with a Granger causality test will confirm its finding. Using this approach, if spurious correlation exists, we will observe significance in cross-correlation at some lags without consistent causal direction or without Granger causality being present.

Cross-Correlation Analysis

This method is accomplished by the following steps:

  • Examine temporal patterns of correlations between variables;
  • •If variable A Granger causes variable B, significant cross-correlation will occur between variable A and variable B at positive lags;
  • Significant peaks in cross-correlation at specific lags infers the time delay between changes in the causal variable.

Interpretation:

The ccf and lag values show significance in positive correlation at certain lags. This confirms that spurious correlation exists. However, like the Granger causality, the cross-correlation analysis cannot support the claim that causality exists in the relationship between the two variables.

  • Spurious correlations are a form of p-hacking. Correlation does not imply causation.
  • Even with ‘bad’ data tactics, statistical testing will root out the lack of significance. While there was statistical evidence of spuriousness in the variables, causality testing could not support the claim that causality existed in the relationship of the variables.
  • A study cannot rest on the sole premise that variables displaying linearity can be correlated to exhibit causality. Instead, other factors that contribute to each variable must be considered.
  • A non-statistical test of whether daily high temperatures in New York City cause Apple stock to fluctuate can be to just consider: If you owned an Apple stock certificate and you placed it in the freezer, would the value of the certificate be impacted by the cold? Similarly, if you placed the certificate outside on a sunny, hot day, would the sun impact the value of the certificate?
https://www.freepik.com/free-vector/business-people-saying-no-concept-illustration_38687005.htm#query=refuse%20work&position=20&from_view=keyword&track=ais&uuid=e5cd742b-f902-40f7-b7c4-812b147fe1df Image by storyset on Freepik

Spurious correlations are not causality.

P-hacking may impact your credibility as a

data scientist. Be the adult in the room and

refuse to participate in bad statistics.

This study portrayed analysis that involved ‘bad’ statistics. It demonstrated how a data scientist could source, extract and manipulate data in such a way as to statistically show correlation. In the end, statistical testing withstood the challenge and demonstrated that correlation does not equal causality.

​Conducting a spurious correlation brings ethical questions of using statistics to derive causation in two unrelated variables. It is an example of p-hacking, which exploits statistics in order to achieve a desired outcome. This study was done as academic research to show the absurdity in misusing statistics.

​Another area of ethical consideration is the practice of web scraping. Many website owners warn against pulling data from their sites to use in nefarious ways or ways unintended by them. For this reason, sites like Yahoo Finance make stock data downloadable to csv files. This is also true for most weather sites where you can request time datasets of temperature readings. Again, this study is for academic research and to demonstrate one’s ability to extract data in a nonconventional way.

​When faced with a boss or client that compels you to p-hack and offer something like a spurious correlation as proof of causality, explain the implications of their ask and respectfully refuse the project. Whatever your decision, it will have a lasting impact on your credibility as a data scientist.

Dr. Banks is CEO of I-Meta, maker of the patented Spice Chip Technology that provides Big Data analytics for various industries. Mr. Boothroyd, III is a retired Military Analyst. Both are veterans having honorably served in the United States military and both enjoy discussing spurious correlations. They are cohorts of the University of Michigan, School of Information MADS program…Go Blue!



Source link

Be the first to comment

Leave a Reply

Your email address will not be published.


*