Generating Alpha using Starmine Analytical Models - a Python example

This article will demonstrate how we can use Starmine Analytical Models to derive insight and generate alpha, in this case, for equity markets. It is intended as a teaser to intice you to further explore this rich vein rather than a rigorous scientific study. I will be using Eikon and more specifically our new Eikon Data API to access all the data I need to conduct this analysis - however Starmine Analytics is available in other products and delivery channels. I will also provide the Jupyter notebook source via gitub.

Pre-requisites:

Thomson Reuters Eikon with access to new Eikon Data APIs

Python 2.x/3.x

Required Python Packages: eikon, pandas, numpy, matplotlib, sklearn, scipy

Starmine Quantitative Analytics

In short - StarMine provides a suite of proprietary alpha-generating analytics and models spanning sectors, regions, and markets. The list of models is really broad-based and includes both quantitative analytics (such as smartEstimates) and quantitative models (such as the Combined Alpha Model I will demo here).

All of these analytics and models can provide you with new sources of information and subsequently alpha - they are calculated by our Starmine team and delivered to you - saving you many man years of work and research etc. You can find out more about Starmine Analytics by looking at the PDF attached to this article and you can find out more about the Combined Alpha Model by viewing the short 2 min video below.

In [1]:
from IPython.display import HTML
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/wNVJEG9Utlo" frameborder="0" gesture="media" allow="encrypted-media" allowfullscreen></iframe>')
 

How to Access the Starmine and Other Data

As mentioned above I will be using Eikon and specifically our new Eikon Data API - which is a really easy to use yet performant web API. At the time of writing this article it is still in Beta - but all Eikon users can access it by simply upgrading their Eikon to v4.0.36+ and going to the developer portal here and follow instructions. If you have any questions around the API there are also monitored Q&A forums here.

Once we have access to the Eikon Data API its very straightforward to get the data we need. First lets import some packages we will need to conduct this analysis and also set our App ID (this is available from the App ID generator - see the quick start guide for further details):

In [2]:
import eikon as ek
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
ek.set_app_key('YOUR APP KEY HERE')
 

Next we need to formulate our API call. In this case, I will be requesting data for all CAC-40 index constituents and I will be requesting a mixture of both Reference, Starmine Analytics and Price performance data for 8 quarters - I have made it more generic here - but essentially this is one line of code! And as you will see the API returns the data in a pandas dataframe.

In [3]:
RICS = ['0#.FCHI']
fields =['TR.TRBCIndustryGroup','TR.CombinedAlphaCountryRank(SDate=0,EDate=-7,Frq=FQ)','TR.CombinedAlphaCountryRank(SDate=0,EDate=-7,Frq=FQ).Date',
         'TR.TotalReturn3Mo(SDate=0,EDate=-7,Frq=FQ)','TR.TotalReturn3Mo(SDate=0,EDate=-7,Frq=FQ).calcdate']

ids,err=ek.get_data(RICS,fields=fields)
ids.head(20)
Out[3]:
  Instrument TRBC Industry Group Name Combined Alpha Model Country Rank Date 3 Month Total Return Calc Date
0 ACCP.PA Hotels & Entertainment Services 13.0 2017-12-31T00:00:00Z 2.295706 2017-12-31
1 ACCP.PA   5.0 2017-09-30T00:00:00Z 2.411987 2017-09-30
2 ACCP.PA   36.0 2017-06-30T00:00:00Z 7.890774 2017-06-30
3 ACCP.PA   14.0 2017-03-31T00:00:00Z 10.217330 2017-03-31
4 ACCP.PA   17.0 2016-12-31T00:00:00Z 0.339847 2016-12-31
5 ACCP.PA   5.0 2016-09-30T00:00:00Z 1.787259 2016-09-30
6 ACCP.PA   13.0 2016-06-30T00:00:00Z -5.187512 2016-06-30
7 ACCP.PA   9.0 2016-03-31T00:00:00Z -6.911636 2016-03-31
8 AIRP.PA Chemicals 44.0 2017-12-31T00:00:00Z 2.396977 2017-12-31
9 AIRP.PA   22.0 2017-09-30T00:00:00Z 4.297597 2017-09-30
10 AIRP.PA   20.0 2017-06-30T00:00:00Z 3.664822 2017-06-30
11 AIRP.PA   17.0 2017-03-31T00:00:00Z 1.372456 2017-03-31
12 AIRP.PA   24.0 2016-12-31T00:00:00Z 10.686223 2016-12-31
13 AIRP.PA   10.0 2016-09-30T00:00:00Z 4.185135 2016-09-30
14 AIRP.PA   38.0 2016-06-30T00:00:00Z -4.252091 2016-06-30
15 AIRP.PA   26.0 2016-03-31T00:00:00Z -4.592378 2016-03-31
16 AIR.PA Aerospace & Defense 62.0 2017-12-31T00:00:00Z 3.220992 2017-12-31
17 AIR.PA   25.0 2017-09-30T00:00:00Z 11.680556 2017-09-30
18 AIR.PA   59.0 2017-06-30T00:00:00Z 2.453877 2017-06-30
19 AIR.PA   52.0 2017-03-31T00:00:00Z 13.510503 2017-03-31
 

Now that we have our data in a dataframe we may need to do some wrangling to get the types correctly set. The get_data call is the most flexible of calls so this is to be expected. We can easily check the types of data in the dataframe by column:

In [4]:
ids.info()
 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 320 entries, 0 to 319
Data columns (total 6 columns):
Instrument                           320 non-null object
TRBC Industry Group Name             320 non-null object
Combined Alpha Model Country Rank    316 non-null float64
Date                                 320 non-null object
3 Month Total Return                 320 non-null float64
Calc Date                            320 non-null object
dtypes: float64(2), object(4)
memory usage: 10.0+ KB
 

First we want to cast the Date object (a string) as a datetime type, then we want to set that datetime field as the index for the frame. Secondly, we want to make sure any numeric fields are numeric and any text fields are strings:

In [5]:
ids['Date']=pd.to_datetime(ids['Date'])
ads=ids.set_index('Date')[['Instrument','TRBC Industry Group Name','Combined Alpha Model Country Rank','3 Month Total Return']]
ads['3 Month Total Return'] = pd.to_numeric(ads['3 Month Total Return'], errors='coerse')
ads['TRBC Industry Group Name'] = ads['TRBC Industry Group Name'].astype(str)
ads.head(10) 
Out[5]:
  Instrument TRBC Industry Group Name Combined Alpha Model Country Rank 3 Month Total Return
Date        
2017-12-31 ACCP.PA Hotels & Entertainment Services 13.0 2.295706
2017-09-30 ACCP.PA   5.0 2.411987
2017-06-30 ACCP.PA   36.0 7.890774
2017-03-31 ACCP.PA   14.0 10.217330
2016-12-31 ACCP.PA   17.0 0.339847
2016-09-30 ACCP.PA   5.0 1.787259
2016-06-30 ACCP.PA   13.0 -5.187512
2016-03-31 ACCP.PA   9.0 -6.911636
2017-12-31 AIRP.PA Chemicals 44.0 2.396977
2017-09-30 AIRP.PA   22.0 4.297597
In [6]:
ads.dtypes
Out[6]:
Instrument                            object
TRBC Industry Group Name              object
Combined Alpha Model Country Rank    float64
3 Month Total Return                 float64
dtype: object
In [7]:
ads.head()
Out[7]:
  Instrument TRBC Industry Group Name Combined Alpha Model Country Rank 3 Month Total Return
Date        
2017-12-31 ACCP.PA Hotels & Entertainment Services 13.0 2.295706
2017-09-30 ACCP.PA   5.0 2.411987
2017-06-30 ACCP.PA   36.0 7.890774
2017-03-31 ACCP.PA   14.0 10.217330
2016-12-31 ACCP.PA   17.0 0.339847
 

A bit more wrangling is required here as we can see for example that the API has returned the TRBC Industry Group only for the most current instance of each instrument - not for the historical quarters. We can recitfy this easily in 2 lines of code. First we will replace blanks with nan (not a number). We can then use the excellent and surgical fillna dataframe function to fill the Sector name to the historic quarters:

In [8]:
ads1 = ads.replace('', np.nan, regex=True)
In [9]:
ads1['TRBC Industry Group Name'].fillna(method='ffill',limit=7, inplace=True)
ads1.head(15)
Out[9]:
  Instrument TRBC Industry Group Name Combined Alpha Model Country Rank 3 Month Total Return
Date        
2017-12-31 ACCP.PA Hotels & Entertainment Services 13.0 2.295706
2017-09-30 ACCP.PA Hotels & Entertainment Services 5.0 2.411987
2017-06-30 ACCP.PA Hotels & Entertainment Services 36.0 7.890774
2017-03-31 ACCP.PA Hotels & Entertainment Services 14.0 10.217330
2016-12-31 ACCP.PA Hotels & Entertainment Services 17.0 0.339847
2016-09-30 ACCP.PA Hotels & Entertainment Services 5.0 1.787259
2016-06-30 ACCP.PA Hotels & Entertainment Services 13.0 -5.187512
2016-03-31 ACCP.PA Hotels & Entertainment Services 9.0 -6.911636
2017-12-31 AIRP.PA Chemicals 44.0 2.396977
2017-09-30 AIRP.PA Chemicals 22.0 4.297597
2017-06-30 AIRP.PA Chemicals 20.0 3.664822
2017-03-31 AIRP.PA Chemicals 17.0 1.372456
2016-12-31 AIRP.PA Chemicals 24.0 10.686223
2016-09-30 AIRP.PA Chemicals 10.0 4.185135
2016-06-30 AIRP.PA Chemicals 38.0 -4.252091
 

We now need to shift the 3 Month Total Return column down by 1 so we can map CAM ranks to forward looking performance. Thankfully in pandas this is trivial and note the Groupby instrument clause.

In [10]:
ads1['3 Month Total Return'] = ads1.groupby('Instrument')['3 Month Total Return'].shift()
In [11]:
ads1.head(15)
Out[11]:
  Instrument TRBC Industry Group Name Combined Alpha Model Country Rank 3 Month Total Return
Date        
2017-12-31 ACCP.PA Hotels & Entertainment Services 13.0 NaN
2017-09-30 ACCP.PA Hotels & Entertainment Services 5.0 2.295706
2017-06-30 ACCP.PA Hotels & Entertainment Services 36.0 2.411987
2017-03-31 ACCP.PA Hotels & Entertainment Services 14.0 7.890774
2016-12-31 ACCP.PA Hotels & Entertainment Services 17.0 10.217330
2016-09-30 ACCP.PA Hotels & Entertainment Services 5.0 0.339847
2016-06-30 ACCP.PA Hotels & Entertainment Services 13.0 1.787259
2016-03-31 ACCP.PA Hotels & Entertainment Services 9.0 -5.187512
2017-12-31 AIRP.PA Chemicals 44.0 NaN
2017-09-30 AIRP.PA Chemicals 22.0 2.396977
2017-06-30 AIRP.PA Chemicals 20.0 4.297597
2017-03-31 AIRP.PA Chemicals 17.0 3.664822
2016-12-31 AIRP.PA Chemicals 24.0 1.372456
2016-09-30 AIRP.PA Chemicals 10.0 10.686223
2016-06-30 AIRP.PA Chemicals 38.0 4.185135
In [12]:
ads1.info()
 
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 320 entries, 2017-12-31 to 2016-03-31
Data columns (total 4 columns):
Instrument                           320 non-null object
TRBC Industry Group Name             320 non-null object
Combined Alpha Model Country Rank    316 non-null float64
3 Month Total Return                 280 non-null float64
dtypes: float64(2), object(2)
memory usage: 10.0+ KB
In [13]:
ads1.dropna(axis=0, how='any', inplace=True)
ads1.head(20)
Out[13]:
  Instrument TRBC Industry Group Name Combined Alpha Model Country Rank 3 Month Total Return
Date        
2017-09-30 ACCP.PA Hotels & Entertainment Services 5.0 2.295706
2017-06-30 ACCP.PA Hotels & Entertainment Services 36.0 2.411987
2017-03-31 ACCP.PA Hotels & Entertainment Services 14.0 7.890774
2016-12-31 ACCP.PA Hotels & Entertainment Services 17.0 10.217330
2016-09-30 ACCP.PA Hotels & Entertainment Services 5.0 0.339847
2016-06-30 ACCP.PA Hotels & Entertainment Services 13.0 1.787259
2016-03-31 ACCP.PA Hotels & Entertainment Services 9.0 -5.187512
2017-09-30 AIRP.PA Chemicals 22.0 2.396977
2017-06-30 AIRP.PA Chemicals 20.0 4.297597
2017-03-31 AIRP.PA Chemicals 17.0 3.664822
2016-12-31 AIRP.PA Chemicals 24.0 1.372456
2016-09-30 AIRP.PA Chemicals 10.0 10.686223
2016-06-30 AIRP.PA Chemicals 38.0 4.185135
2016-03-31 AIRP.PA Chemicals 26.0 -4.252091
2017-09-30 AIR.PA Aerospace & Defense 25.0 3.220992
2017-06-30 AIR.PA Aerospace & Defense 59.0 11.680556
2017-03-31 AIR.PA Aerospace & Defense 52.0 2.453877
2016-12-31 AIR.PA Aerospace & Defense 14.0 13.510503
2016-09-30 AIR.PA Aerospace & Defense 13.0 16.716196
2016-06-30 AIR.PA Aerospace & Defense 22.0 4.078871
 

So now our Dataframe is seemingly in good shape - what are we going to do next? We can start with the question do higher CAM score leads to better performance? Or put more generically, are CAM scores RELATED to performance? Remember these CAM scores are dynamic and change overtime depending on changes in their underlying components AND relative to other companies in the universe (in this case companies in the same country).

So lets just see what this looks like in terms of a scatterplot for one sector of the CAC-40. Here we are just filtering the frame using a text filter and aggregating using a groupby statement. In this case we have 2 companies.

In [14]:
%matplotlib inline
adsl = ads1[ads1['TRBC Industry Group Name'] =='Industrial Conglomerates']
adsl = adsl.groupby('Instrument')
ax = adsl.plot(x='Combined Alpha Model Country Rank', y='3 Month Total Return', kind='scatter')
 

So lets be clear about what we are looking at. Each dot represents a pair of observations for CAM rank and 3M total return for a quarter - we should have 7 quarters of observations for each instrument. So we could implement a simple linear regression and that would have different parameters such as intercept, slope.

As I am in exploratory mode - I just am interested in the slope of the best fit linear line (of the form y = ax + b), where a is the slope coefficient. This should offer me some indication of how 3 Month total returns change for an increase in CAM rank. So our slope coefficient can tell us for example whether our variables are positively related (positive a) or negatively related (negative a) or really not related at all (near zero a).

We can also more formally calculate Spearman's Rank Correlation Coefficient which will give us a more rigorous measure of relatedness and also a probability that the null hypothesis (the two variables are not related) is true.

So rather than create plots for everything we can just create a model to provide the linear best fit solution for each instrument and then store that value in a new column as well as calculating the Spearman's Rank Correlation Coefficient and p-value.

First, I want to check for if any data is null as this may result in errors downstream.

In [15]:
ads1.isnull().any().count()
Out[15]:
4
 

Here we can see there are 4 null values in our dataset - we can deal with these by replacing them with either mean values or most recent values. In our case, I choose the latter and will use a forward fill function to replace nulls:

In [16]:
adsNN = ads1.fillna(method='ffill')
In [17]:
adsNN.isnull().any()
Out[17]:
Instrument                           False
TRBC Industry Group Name             False
Combined Alpha Model Country Rank    False
3 Month Total Return                 False
dtype: bool
 

Now we have confirmed we have no null values we can move on. Next we will use the Linear Regression model from the Linear Model tools from Scikit Learn package. We want to solve for 7 quarters of data for each Instrument. So we iterate over each intrument then use the model.fit method to generate the best fit linear solution (OLS) and then store the 'coef_' parameter of the model as a new column in the adsNN dataframe called 'slope'.

Whilst we are here we will also calculate a Spearman's Rank Correlation Coefficient using a routine from scipy package. The routine returns 2 values, the first is the Coefficient (Rho) and the second is the p-value. I just store these 2 elements in 2 seperate columns.

In [18]:
import sklearn
import scipy
from sklearn import linear_model
model = linear_model.LinearRegression()
for (group, adsNN_gp) in adsNN.groupby('Instrument'):
    X=adsNN_gp[['Combined Alpha Model Country Rank']]
    y=adsNN_gp[['3 Month Total Return']]
    model.fit(X,y)
    spearmans = scipy.stats.spearmanr(X,y)
    adsNN.loc[adsNN.Instrument == adsNN_gp.iloc[0].Instrument, 'slope'] = model.coef_
    adsNN.loc[adsNN.Instrument == adsNN_gp.iloc[0].Instrument, 'Rho'] = spearmans[0]
    adsNN.loc[adsNN.Instrument == adsNN_gp.iloc[0].Instrument, 'p'] = spearmans[1]

adsNN.head(15)
Out[18]:
  Instrument TRBC Industry Group Name Combined Alpha Model Country Rank 3 Month Total Return slope Rho p
Date              
2017-09-30 ACCP.PA Hotels & Entertainment Services 5.0 2.295706 0.119443 0.684712 0.089666
2017-06-30 ACCP.PA Hotels & Entertainment Services 36.0 2.411987 0.119443 0.684712 0.089666
2017-03-31 ACCP.PA Hotels & Entertainment Services 14.0 7.890774 0.119443 0.684712 0.089666
2016-12-31 ACCP.PA Hotels & Entertainment Services 17.0 10.217330 0.119443 0.684712 0.089666
2016-09-30 ACCP.PA Hotels & Entertainment Services 5.0 0.339847 0.119443 0.684712 0.089666
2016-06-30 ACCP.PA Hotels & Entertainment Services 13.0 1.787259 0.119443 0.684712 0.089666
2016-03-31 ACCP.PA Hotels & Entertainment Services 9.0 -5.187512 0.119443 0.684712 0.089666
2017-09-30 AIRP.PA Chemicals 22.0 2.396977 -0.250236 -0.571429 0.180202
2017-06-30 AIRP.PA Chemicals 20.0 4.297597 -0.250236 -0.571429 0.180202
2017-03-31 AIRP.PA Chemicals 17.0 3.664822 -0.250236 -0.571429 0.180202
2016-12-31 AIRP.PA Chemicals 24.0 1.372456 -0.250236 -0.571429 0.180202
2016-09-30 AIRP.PA Chemicals 10.0 10.686223 -0.250236 -0.571429 0.180202
2016-06-30 AIRP.PA Chemicals 38.0 4.185135 -0.250236 -0.571429 0.180202
2016-03-31 AIRP.PA Chemicals 26.0 -4.252091 -0.250236 -0.571429 0.180202
2017-09-30 AIR.PA Aerospace & Defense 25.0 3.220992 -0.234564 -0.607143 0.148231
 

Voila - I have all the calculations I requested and I think I just want to average these by Instrument so I can get a summary view. (note averaging the slope, Rho and p values does not change them as they were calculated once for the 7 periods and just copied 7 times).

In [19]:
Averages = adsNN.groupby(['Instrument']).mean()
Averages
Out[19]:
  Combined Alpha Model Country Rank 3 Month Total Return slope Rho p
Instrument          
ACCP.PA 14.142857 2.822199 0.119443 0.684712 0.089666
AIR.PA 33.571429 5.714654 -0.234564 -0.607143 0.148231
AIRP.PA 22.428571 3.193017 -0.250236 -0.571429 0.180202
ATOS.PA 94.142857 8.784536 -0.743983 -0.054056 0.908365
AXAF.PA 64.000000 4.571939 -0.196042 -0.126131 0.787572
BNPP.PA 60.142857 7.160401 -0.079672 -0.252262 0.585241
BOUY.PA 73.428571 4.506075 -0.323423 -0.500000 0.253170
CAGR.PA 55.285714 8.206785 -0.066242 0.142857 0.759945
CAPP.PA 59.142857 3.231260 -0.358011 -0.535714 0.215217
CARR.PA 26.285714 -2.853468 0.209579 -0.107143 0.819151
DANO.PA 29.428571 2.312771 0.070155 0.142857 0.759945
ENGIE.PA 46.428571 2.994981 -0.117695 -0.450469 0.310429
ESSI.PA 37.285714 1.446916 -0.205101 -0.392857 0.383317
EXHO.PA 55.428571 2.871624 -0.066887 -0.540562 0.210289
FTI.PA 15.666667 -4.249451 -0.995839 -1.000000 0.000000
LEGD.PA 45.142857 4.603834 -0.017527 -0.035714 0.939408
LHN.PA 35.857143 3.710595 -0.248977 -0.214286 0.644512
LVMH.PA 70.571429 8.080266 -0.167705 -0.250000 0.588724
MICP.PA 89.857143 5.207835 0.584094 0.630656 0.128888
MT.AS 70.857143 13.962780 0.107790 0.198206 0.670085
ORAN.PA 53.142857 0.186049 -0.193575 -0.846881 0.016197
OREP.PA 43.000000 2.721421 0.055467 0.642857 0.119392
PERP.PA 33.285714 4.797600 -0.066685 -0.071429 0.879048
PEUP.PA 84.428571 3.700341 -0.261221 -0.214286 0.644512
PRTP.PA 62.428571 14.708734 -0.008738 -0.018019 0.969415
PUBP.PA 62.142857 -0.332859 0.155443 0.321429 0.482072
RENA.PA 67.000000 0.840024 -0.286120 -0.054056 0.908365
SAF.PA 57.571429 5.371726 -0.176439 -0.357143 0.431611
SASY.PA 71.285714 1.715126 -0.241940 -0.285714 0.534509
SCHN.PA 53.142857 4.601030 -0.002400 0.214286 0.644512
SGEF.PA 70.285714 4.670457 -0.264057 -0.714286 0.071344
SGOB.PA 68.285714 3.861674 0.131461 0.000000 1.000000
SOGN.PA 55.857143 7.180684 0.029665 0.000000 1.000000
SOLB.BR 61.142857 4.775618 -0.236724 -0.250000 0.588724
STM.PA 73.285714 22.908329 -0.388359 -0.535714 0.215217
TOTF.PA 51.857143 3.422500 -0.031798 -0.185312 0.690778
UNBP.AS 32.000000 -0.891114 -0.095466 -0.071429 0.879048
VIE.PA 43.428571 1.959539 -0.331911 -0.828862 0.021174
VIV.PA 28.285714 4.241614 -0.079870 -0.250000 0.588724
VLOF.PA 88.142857 5.843416 0.022018 -0.306319 0.504027
 

Conclusion

So now we have a summarised view of our study - can we answer the question posited earlier? 'Are CAM scores RELATED to performance?' Let us have a look at average CAM score versus average 3 month total return (over 7 quarters). This seems to indicate that there is a positive relation between the two variables.

In [20]:
ax = Averages.plot(x='Combined Alpha Model Country Rank', y='3 Month Total Return', kind='scatter')
 
 

We can check this further by running the same Spearman's Rank Correlation Coefficient test against these summary results. Here we can see Rho of 50 which is both large & positive and with p of 0.101 the null hypothesis can be safetly rejected.

In [21]:
spearmans = scipy.stats.spearmanr(Averages['Combined Alpha Model Country Rank'],Averages['3 Month Total Return'])
Rho = spearmans[0]
p = spearmans[1]
print(Rho,p)
 
0.5003048928999382 0.001010764442996582