1. Introduction to Data Analysis with Pandas#

Pandas is a library for data analysis, manipulation, and visualization. The basic object of the defined by this module is the DataFrame. This is a dataset used in this notebook can be obtained from Kaggle on the classification of stars. We load the data from a CSV file into a Pandas DataFrame and demonstrate some basic functionality of the module.

You can think of data frames as tables basically, where each row is an data entry and each of the columns is a property of that entry. In our case, each entry is gonna be a star and the columns some of its properties. We could also use each row to store the properties of some system at different time-points, eg. the concentration of different proteins over time in a cell, in this case each row would be a time-point and each column would be the different proteins.

As opposed to numpy arrays, Pandas data frames allow to work with the data by using labels —eg. ‘temperature’— rather than having to remember the index numbers associated to the temperature data.

Another great aspect of Pandas data frames is that we can mix types of data, eg. numerical variables like the mass of object —eg. 21.2 mg— and categorical data like a cell type —eg. ‘cortical neuron’—.

In the field of data science the columns of dataset are often referred as ‘feature vectors’. If you encounter that term, simply replace it in your mind by ‘column’.

1.1. Data description#

Each row represent a star.

Feature vectors:

  • Temperature – The surface temperature of the star

  • Luminosity – Relative luminosity: how bright it is

  • Size – Relative radius: how big it is

  • AM – Absolute magnitude: another measure of the star luminosity

  • Color – General Color of Spectrum

  • Type – Red Dwarf, Brown Dwarf, White Dwarf, Main Sequence , Super Giants, Hyper Giants

  • Spectral_Class – O,B,A,F,G,K,M / SMASS - Stellar classification

1.2. Loading data#

import numpy as np
import matplotlib.pyplot as plt
import matplotlib
import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn'

The DataFrame can be created from a csv file using the read_csv method. If you are working on Colab, you will need to upload the data.

Notice that in this case we are loading data from a .csv file, but with Pandas we can load pretty much any kind of data format, including matlab data files.

df = pd.read_csv('Stars.csv')

The head method displays the first few rows of data together with the column headers

df.head()
Temperature Luminosity Size A_M Color Spectral_Class Type
0 3068 0.002400 0.1700 16.12 Red M Red Dwarf
1 3042 0.000500 0.1542 16.60 Red M Red Dwarf
2 2600 0.000300 0.1020 18.70 Red M Red Dwarf
3 2800 0.000200 0.1600 16.65 Red M Red Dwarf
4 1939 0.000138 0.1030 20.06 Red M Red Dwarf

Specific columns of the DataFrame can be accessed by specifying the column name between square brackets:

stars_colors = df['Color'] # notice the columns names are case sensitive, ie. 'color' != 'Color'
print(stars_colors)
0        Red
1        Red
2        Red
3        Red
4        Red
       ...  
235     Blue
236     Blue
237    White
238    White
239     Blue
Name: Color, Length: 240, dtype: object

The individual entries of the DataFrame (ie. rows) can be accessed using the iloc method, specifying the index:

print(df.iloc[[0]]) # where 0 is the index of the first entry
   Temperature  Luminosity  Size    A_M Color Spectral_Class       Type
0         3068      0.0024  0.17  16.12   Red              M  Red Dwarf

The describe method will give basic summary statistics for the numerical variables of each column

summary = df.describe()
print(summary)
        Temperature     Luminosity         Size         A_M
count    240.000000     240.000000   240.000000  240.000000
mean   10497.462500  107188.361635   237.157781    4.382396
std     9552.425037  179432.244940   517.155763   10.532512
min     1939.000000       0.000080     0.008400  -11.920000
25%     3344.250000       0.000865     0.102750   -6.232500
50%     5776.000000       0.070500     0.762500    8.313000
75%    15055.500000  198050.000000    42.750000   13.697500
max    40000.000000  849420.000000  1948.500000   20.060000

We can also call methods of the individual columns to get summary information. The column objects (such as df[‘Temperature’]) are called Series

print("Mean Temperature is:",df['Temperature'].mean())
print("Max Temperature is:",df['Temperature'].max())
Mean Temperature is: 10497.4625
Max Temperature is: 40000

1.3. Visualize single variable data#

The Series objects (columns) have plot methods as well as the numerical summary methods.

df['Temperature'].plot.line(figsize=(10,7))
<Axes: >
../../_images/484b1cba160e392923dcad744fa8ccb3a850562edd238c06f89741094ac20d40.png

Pandas is interoperable with matplotlib and numpy, so for instance if we want to add labels to the figure above we simply add the following lines from matplotlib:

df['Temperature'].plot.line(figsize=(10,7));
plt.xlabel('Star index')
plt.ylabel('Star Temperature')
Text(0, 0.5, 'Star Temperature')
../../_images/ccb537275ba3427d9b1b96adba64f0d85ace93162feccd90d9415c9341597247.png

The above is equivalent to:

df['Temperature'].plot.line(xlabel = 'Star index', ylabel='Temperature', figsize=(10,7))
<Axes: xlabel='Star index', ylabel='Temperature'>
../../_images/1b84dad71086b9178b52a9a2eeb6fe65cee0da2e467b57fea220065416465c63.png

1.3.1. Exercise#

  • Check Pandas series.plot documentation and plot the temperature of the different stars as an histogram.

  • By observing at the histogram, what’s the most common temperature for stars?

## Your code here

1.4. Scatter plots for multiple variables#

A typical problem in any field is to understand how some properties relate others, eg. are two properties independent or correlated to each other? We can quickly explore the correlations in some data frame by using scatter plots and plotting some properties against others:

df.plot.scatter('Temperature','Luminosity', figsize=(10,7))
<Axes: xlabel='Temperature', ylabel='Luminosity'>
../../_images/0ba1476acca1a1908cbc8df74aac431e38739b1dccf907e63a41b7f2515fdf6c.png

We notice that the values of the Luminosity go from very small to very big values:

print(df['Luminosity'].min())
print(df['Luminosity'].max())
8e-05
849420.0

In this situations where we are plotting over a very long range of values, it’s useful to change the scale to a logarithmic one:

ax = df.plot.scatter('Temperature','Luminosity', figsize=(10,7))
plt.yscale('log')
../../_images/5af8a4ac719be26d36ae0383957e7703b67fafb47932c017756832719eff13c8.png

1.4.1. Exercise#

  • Make scatter plots of the different star features.

  • Two of the feature columns in the data are monotonically correlated, find them. # Hint: you may need to use log scale to better see a linear correlation.

## Your code here

1.5. Sort the data#

We can sort the data using the sort_values method:

sorted_data = df.sort_values('Temperature',ascending=True)
sorted_data.head()
Temperature Luminosity Size A_M Color Spectral_Class Type
4 1939 0.000138 0.103 20.06 Red M Red Dwarf
2 2600 0.000300 0.102 18.70 Red M Red Dwarf
7 2600 0.000400 0.096 17.40 Red M Red Dwarf
78 2621 0.000600 0.098 12.81 Red M Brown Dwarf
6 2637 0.000730 0.127 17.22 Red M Red Dwarf

1.6. Describe categorical data#

We can describe the categorical variable ‘Color’. In this case we get different results than when we used describe on a numerical value.

print(df['Color'].describe())
count     240
unique     17
top       Red
freq      112
Name: Color, dtype: object

We look at the unique values of ‘Color’

print(df['Color'].unique())
['Red' 'Blue White' 'White' 'Yellowish White' 'Blue white'
 'Pale yellow orange' 'Blue' 'Blue-white' 'Whitish' 'yellow-white'
 'Orange' 'White-Yellow' 'white' 'yellowish' 'Yellowish' 'Orange-Red'
 'Blue-White']

1.6.1. Exercise#

  • Create a histogram to visualize how many stars of each color there are.

## Your code here

1.7. Filter and split data#

Sometimes we want to select sections of the data based on their values, we can easily do so with pandas. Let’s find the set of stars whose temperature is higher than 10000 K. We first create a boolean array for the condition, that is, a vector which associate a true or false value to each star with regard to the filtering condition, in our case case, it will give a true value if the start temperature is higher than 10000 K and false otherwise:

hot_stars_boolean_vector = df['Temperature'] > 10000
print(hot_stars_boolean_vector)
0      False
1      False
2      False
3      False
4      False
       ...  
235     True
236     True
237    False
238    False
239     True
Name: Temperature, Length: 240, dtype: bool

In python a true value is represented with the number 1 and a false value with the number zero, that means that if we want to know hot many stars are hotter than 10000 K we can simply sum up our boolean vector:

number_of_hot_stars = np.sum(hot_stars_boolean_vector)
print(f'There are {number_of_hot_stars} hot stars in the dataset')
There are 90 hot stars in the dataset

It works the same for categorical data. Let’s find out the number of super giants stars:

super_giants_boolean_vector = df['Type'] == 'Super Giants' 
nb_super_giants = np.sum(super_giants_boolean_vector)
print(f'There are {nb_super_giants} super giants stars in the dataset')
There are 40 super giants stars in the dataset

If we are only interested in exploring the properties of super giants stars (because our dataset is too big or because white dwarfs are lame), we can get select only the data of the super giants stars using the boolean vector we just created:

df_with_only_super_giants = df[super_giants_boolean_vector]
print(df_with_only_super_giants)
     Temperature  Luminosity  Size    A_M Color Spectral_Class          Type
40          3826    200000.0  19.0 -6.930   Red              M  Super Giants
41          3365    340000.0  23.0 -6.200   Red              M  Super Giants
42          3270    150000.0  88.0 -6.020   Red              M  Super Giants
43          3200    195000.0  17.0 -7.220   Red              M  Super Giants
44          3008    280000.0  25.0 -6.000   Red              M  Super Giants
45          3600    320000.0  29.0 -6.600   Red              M  Super Giants
46          3575    123000.0  45.0 -6.780   Red              M  Super Giants
47          3574    200000.0  89.0 -5.240   Red              M  Super Giants
48          3625    184000.0  84.0 -6.740   Red              M  Super Giants
49         33750    220000.0  26.0 -6.100  Blue              B  Super Giants
100        33300    240000.0  12.0 -6.500  Blue              B  Super Giants
101        40000    813000.0  14.0 -6.230  Blue              O  Super Giants
102        23000    127000.0  36.0 -5.760  Blue              O  Super Giants
103        17120    235000.0  83.0 -6.890  Blue              O  Super Giants
104        11096    112000.0  12.0 -5.910  Blue              O  Super Giants
105        14245    231000.0  42.0 -6.120  Blue              O  Super Giants
106        24630    363000.0  63.0 -5.830  Blue              O  Super Giants
107        12893    184000.0  36.0 -6.340  Blue              O  Super Giants
108        24345    142000.0  57.0 -6.240  Blue              O  Super Giants
109        33421    352000.0  67.0 -5.790  Blue              O  Super Giants
160        25390    223000.0  57.0 -5.920  Blue              O  Super Giants
161        11567    251000.0  36.0 -6.245  Blue              O  Super Giants
162        12675    452000.0  83.0 -5.620  Blue              O  Super Giants
163         5752    245000.0  97.0 -6.630  Blue              O  Super Giants
164         8927    239000.0  35.0 -7.340  Blue              O  Super Giants
165         7282    131000.0  24.0 -7.220  Blue              O  Super Giants
166        19923    152000.0  73.0 -5.690  Blue              O  Super Giants
167        26373    198000.0  39.0 -5.830  Blue              O  Super Giants
168        17383    342900.0  30.0 -6.090  Blue              O  Super Giants
169         9373    424520.0  24.0 -5.990  Blue              O  Super Giants
220        23678    244290.0  35.0 -6.270  Blue              O  Super Giants
221        12749    332520.0  76.0 -7.020  Blue              O  Super Giants
222         9383    342940.0  98.0 -6.980  Blue              O  Super Giants
223        23440    537430.0  81.0 -5.975  Blue              O  Super Giants
224        16787    246730.0  62.0 -6.350  Blue              O  Super Giants
225        18734    224780.0  46.0 -7.450  Blue              O  Super Giants
226         9892    593900.0  80.0 -7.262  Blue              O  Super Giants
227        10930    783930.0  25.0 -6.224  Blue              O  Super Giants
228        23095    347820.0  86.0 -5.905  Blue              O  Super Giants
229        21738    748890.0  92.0 -7.346  Blue              O  Super Giants

Wait, wait. What if we want we want to filter for two conditions, say, we want to keep only the very hoy super giant stars? Low and behold, we simply need to apply both conditions:

hot_super_giants = df[super_giants_boolean_vector & hot_stars_boolean_vector]
print(f"There are {hot_super_giants.shape[0]} super hot giants")
print(hot_super_giants)
There are 25 super hot giants
     Temperature  Luminosity  Size    A_M Color Spectral_Class          Type
49         33750    220000.0  26.0 -6.100  Blue              B  Super Giants
100        33300    240000.0  12.0 -6.500  Blue              B  Super Giants
101        40000    813000.0  14.0 -6.230  Blue              O  Super Giants
102        23000    127000.0  36.0 -5.760  Blue              O  Super Giants
103        17120    235000.0  83.0 -6.890  Blue              O  Super Giants
104        11096    112000.0  12.0 -5.910  Blue              O  Super Giants
105        14245    231000.0  42.0 -6.120  Blue              O  Super Giants
106        24630    363000.0  63.0 -5.830  Blue              O  Super Giants
107        12893    184000.0  36.0 -6.340  Blue              O  Super Giants
108        24345    142000.0  57.0 -6.240  Blue              O  Super Giants
109        33421    352000.0  67.0 -5.790  Blue              O  Super Giants
160        25390    223000.0  57.0 -5.920  Blue              O  Super Giants
161        11567    251000.0  36.0 -6.245  Blue              O  Super Giants
162        12675    452000.0  83.0 -5.620  Blue              O  Super Giants
166        19923    152000.0  73.0 -5.690  Blue              O  Super Giants
167        26373    198000.0  39.0 -5.830  Blue              O  Super Giants
168        17383    342900.0  30.0 -6.090  Blue              O  Super Giants
220        23678    244290.0  35.0 -6.270  Blue              O  Super Giants
221        12749    332520.0  76.0 -7.020  Blue              O  Super Giants
223        23440    537430.0  81.0 -5.975  Blue              O  Super Giants
224        16787    246730.0  62.0 -6.350  Blue              O  Super Giants
225        18734    224780.0  46.0 -7.450  Blue              O  Super Giants
227        10930    783930.0  25.0 -6.224  Blue              O  Super Giants
228        23095    347820.0  86.0 -5.905  Blue              O  Super Giants
229        21738    748890.0  92.0 -7.346  Blue              O  Super Giants

We can apply conditions directly on the data frame:

hot_super_giants = df[(df['Type'] == 'Super Giants') & (df['Temperature'] > 10000)]
print(f"\nThere are {hot_super_giants.shape[0]} super hot giants\n")
print(hot_super_giants)
There are 25 super hot giants

     Temperature  Luminosity  Size    A_M Color Spectral_Class          Type
49         33750    220000.0  26.0 -6.100  Blue              B  Super Giants
100        33300    240000.0  12.0 -6.500  Blue              B  Super Giants
101        40000    813000.0  14.0 -6.230  Blue              O  Super Giants
102        23000    127000.0  36.0 -5.760  Blue              O  Super Giants
103        17120    235000.0  83.0 -6.890  Blue              O  Super Giants
104        11096    112000.0  12.0 -5.910  Blue              O  Super Giants
105        14245    231000.0  42.0 -6.120  Blue              O  Super Giants
106        24630    363000.0  63.0 -5.830  Blue              O  Super Giants
107        12893    184000.0  36.0 -6.340  Blue              O  Super Giants
108        24345    142000.0  57.0 -6.240  Blue              O  Super Giants
109        33421    352000.0  67.0 -5.790  Blue              O  Super Giants
160        25390    223000.0  57.0 -5.920  Blue              O  Super Giants
161        11567    251000.0  36.0 -6.245  Blue              O  Super Giants
162        12675    452000.0  83.0 -5.620  Blue              O  Super Giants
166        19923    152000.0  73.0 -5.690  Blue              O  Super Giants
167        26373    198000.0  39.0 -5.830  Blue              O  Super Giants
168        17383    342900.0  30.0 -6.090  Blue              O  Super Giants
220        23678    244290.0  35.0 -6.270  Blue              O  Super Giants
221        12749    332520.0  76.0 -7.020  Blue              O  Super Giants
223        23440    537430.0  81.0 -5.975  Blue              O  Super Giants
224        16787    246730.0  62.0 -6.350  Blue              O  Super Giants
225        18734    224780.0  46.0 -7.450  Blue              O  Super Giants
227        10930    783930.0  25.0 -6.224  Blue              O  Super Giants
228        23095    347820.0  86.0 -5.905  Blue              O  Super Giants
229        21738    748890.0  92.0 -7.346  Blue              O  Super Giants

1.7.1. Exercise#

  • Find how many ‘White Dwarf’ have a surface temperature between 5000 K and 10000 K

  • Find the mean surface temperature of the White Dwarfs

  • How many times bigger are Super Giants stars compared to White Dwarfs?

  • What’s the variance in the size of Super Giant stars?

## Your code here

1.8. Creating new data frames and adding new columns to data frames#

We can create a new data frame from another one with only some of the original data frame columns. Let’s create a new data frame with only the temperature and type columns:

new_df = df[['Temperature','Type']]
print(new_df.head()) # It's always good practice to print the head of the data frames to make sure we're doing things right
   Temperature       Type
0         3068  Red Dwarf
1         3042  Red Dwarf
2         2600  Red Dwarf
3         2800  Red Dwarf
4         1939  Red Dwarf

We may also want to add new columns to an existing data frame, for instance, if we incorporate new data from a different file or we calculate new quantities based on the previous data. Here we are adding a new column whose values are the inverse of the luminosity:

df['Inverse Luminosity'] = 1 / df['Luminosity']

1.8.1. Exercise#

  • Add a new feature vector to the new data frame with the volume of each star. # Hint: Notice the column ‘Size’ is the radius \(R\) of each star and that the volume of a sphere is \(\frac{4}{3} \pi R^{3}\)

  • (Bonus Exercise) Add a new feature vector to the new data frame with the mass of each star. # Hint: The mass \(m\) of an object is equal to the product of the volume \(V\) by its density \(\rho\), that is, \(m = \rho V \). Notice that different types of stars have different densities so you’ll have to use the filtering as we did above: \(\rho_{Dwarfs} = 10^{5} g/cc\), \(\rho_{Giants} = 10^{-8} g/cc\), \(\rho_{Main\ sequence} = 1 g/cc\). You are welcome to ignore the units, the goal is that you practice how to apply operations to a subset of data frame.

## Your code here

1.9. Box plot of numerical data sorted by category#

Let’s visualise what is the range of temperature of the different stars based on their temperature. To do so, we first select the features we want to visualise and then call a box plot:

# The boxplot argument 'by' will split the plot over the variable given.
df[['Temperature','Type']].boxplot(by='Type', figsize=(10,7))
<Axes: title={'center': 'Temperature'}, xlabel='[Type]'>
../../_images/9c787d8d3eb2904f7b0113cde33d0505e975d28babc7aae65d64ad1cd1c04ef8.png

1.9.1. Exercise#

  • Make a similar figure as the above but displaying the range of volumes of the different start types

## Your code here

1.10. Multi-plot figures#

Now that we know how to filter data, let’s make some figures. We construct a figure with 4 subplots:

fig, ax = plt.subplots(2,2, figsize=(16, 12))
fig.set_figwidth(14)
fig.set_figheight(10)

## Plot the data.  ax[i,j] references the the Axes in row i column j
df.plot.scatter('Temperature','Luminosity',color='xkcd:rust',alpha=0.7,ax=ax[0,0])
df.plot.scatter('Temperature','Size',color='xkcd:blurple',alpha=0.7,ax=ax[0,1])
df.plot.scatter('Temperature','A_M',color='xkcd:slate blue',alpha=0.7,ax=ax[1,0])
df.plot.scatter('Luminosity','Size',color='xkcd:pumpkin',alpha=0.7,ax=ax[1,1])
<Axes: xlabel='Luminosity', ylabel='Size'>
../../_images/f6a6257d87990b5ac3d7c1a4a72affb830b610a63d2248b6fd6cca0d664f22a7.png

We can see in the plot of \(A_M\) versus Temperature, that there is a cluster of points (\(A_M>9\),Temperature \(>5000\)) where the variables appear to have a strong correlation. We might want to isolate and study that particular subset of the data by extracting it to a different DataFrame.

Let’s isolate it into the variable df_TAM and plot it in a different color:

df_AM = df[df['A_M'] > 9]
df_TAM = df_AM[df_AM['Temperature'] > 5000]

## Plot the subset with the original
ax = df.plot.scatter('Temperature','A_M',color='xkcd:slate blue',alpha=0.8)
df_TAM.plot.scatter('Temperature','A_M',color='xkcd:red',ax=ax,alpha=0.7)
<Axes: xlabel='Temperature', ylabel='A_M'>
../../_images/9cab266ea0852734a11705aa66bf6aa007990437612023d73ec44d55b9a0317a.png

Let’s print the statistics of this subset of data

print(df_TAM.describe())
        Temperature  Luminosity       Size        A_M  Inverse Luminosity
count     40.000000   40.000000  40.000000  40.000000           40.000000
mean   13931.450000    0.002434   0.010728  12.582500         2836.282072
std     4957.655189    0.008912   0.001725   1.278386         3270.623635
min     7100.000000    0.000080   0.008400  10.180000           17.857143
25%     9488.750000    0.000287   0.009305  11.595000          814.754098
50%    13380.000000    0.000760   0.010200  12.340000         1316.701317
75%    17380.000000    0.001227   0.012025  13.830000         3479.064039
max    25000.000000    0.056000   0.015000  14.870000        12500.000000

1.11. Computing correlations#

Let’s explore the linear correlations of the data. Pearson correlation coefficient \(\rho\) is a measure of how linearly correlated two variables are: it’s 1 if there is a positive correlation, -1 if negative and zero if none.

A correlation coefficient tell us how much one variable is related to another or, in other words, how much one variable informs us about the other one. For instance, your height in meters should be perfectly correlated to your height measured in feet \(\rho=1\), but your height should not be correlated to how much chocolate you eat when you’re feeling sad \(\rho=0\).

A correlation is said to be linear if you can convert from variable to other one by using linear transformations only —ie. addition and multiplication but not applying powers or square roots, etc.

Let’s use Scipy to compute the correlations of our data. One of the nice aspects of the Python ecosystem is that data is often interoperable between libraries, here we’re gonna load our star data with Pandas and use Scipy to compute the correlations.

Let’s start by doing a sanity check, a variable should be VERY correlated to itself, right? Let’s plot the temperature against the temperature using a scatter plot:

df.plot.scatter('Temperature','Temperature', figsize=(14,10));
../../_images/6505730fec9759c5ef4325e57b807f554ce007481d156a8cec3ec43d69d1d7eb.png

What value of the pearson correlation coefficient do expect to have? If it’s not obvious to you, think about it before running the next code cell.

from scipy import stats

r, p = stats.pearsonr(df['Temperature'], df['Temperature'])
print(f"The correlation coefficient is {r}")
The correlation coefficient is 0.9999999999999998

A variable always has correlation coefficient of one with itself. Let’s now explore the rest of the data.

1.11.1. Exercise#

  • Find the two pairs of variables with the highest absolute correlation # Hint: You can use Scipy’s stats.pearsonr function, otherwise Pandas data frames have a method corr() that outputs Pearson correlation between the different variables. # Hint 2 : If you wanna plot you can use dataframe_name.corr().style.background_gradient(cmap='coolwarm')

  • Once you find the two variables, make their scatter plot again but this apply the logarithmic function np.log(df['whichever variable']) before computing the Pearson correlation coefficient again. How does the Pearson correlation coefficient changes after applying the logarithmic transformation?

1.12. Non-linear correlations#

Look at the following figure, the number above each dataset is their Pearson coefficient:

Notice how the data points on the bottom clearly have some correlations, however Pearson tells us it’s zero.. That’s because they are non linear correlations.

There exist many types of correlations coefficients we can compute, some of the like Spearman, can even capture non linear correlations. We won’t go explore them further here, but be aware that they exist if you ever are suspicious your data may be trying to hide non-linear correlations.

1.13. The relation between correlation coefficients and predictive models#

Machine learning models (as any other model) are typically used to connect one variable to another. What happens if these two variables are not correlated? Well then it’s simply not possible to build a model predicting one variable as a function of the other. If two variables X and Y are independent —ie. not correlated— that means that knowing X does not provide us any information about Y. The opposite is true, if two quantities are correlated, we should —in principle— be able to build a model linking them.

Now, in most natural phenomena, quantities are high-dimensional and non-linearly correlated so we can’t simply predict if we would be able to build a model based on some correlation coefficient. In these cases, training and evaluating the model is the only way of looking for correlations.

1.14. Linear Regression#

Let’s finish this notebook by doing a liner regression on the data.

A linear regression consist in a linear model than relates one variable to another variable. For instance, the temperature of a star to it luminosity. Linear models have the advantage of being easily interpreted —you can look at the model and figure out what’s going on. On the other hand, they can not model non-linear relations, and God had the poor taste of making natural phenomena very non-linear. On the Machine learning notebooks, we’ll learn how to train models that can deal with non-linear dependencies.

Since the data in the high absolute magnitude \(A_M\), high-Temperature subset seem to be strongly correlated, we might fit linear model. To do this we will import the \(\texttt{linregress}\) function from the \(\texttt{stats}\) module in SciPy.

from scipy.stats import linregress

linear_model = linregress(df_TAM['Temperature'],df_TAM['A_M'])

Let’s plot the regression line together with the data.

m = linear_model.slope
b = linear_model.intercept

x = np.linspace(5000,25000,5) # Range of temperatures
y = m*x + b
ax = df_TAM.plot.scatter('Temperature','A_M',color='xkcd:slate blue',s=40,edgecolor='black',alpha=0.8)
ax.plot(x,y,color='green',ls='dashed')
[<matplotlib.lines.Line2D at 0x13d13a590>]
../../_images/51a356922f704ea10934b297815d2123953d91d263899a7b7500ac2570c8d3e1.png

The model object that was produced by \(\texttt{linregress}\) also contains the correlation coefficient, pvalue, and standard error.

print("Correlation coefficient:",linear_model.rvalue)
print("pvalue for null hypothesis of slope = 0:",linear_model.pvalue)
print("Standard error of the esimated gradient:",linear_model.stderr)
Correlation coefficient: -0.8201933172733418
pvalue for null hypothesis of slope = 0: 9.411965493687398e-11
Standard error of the esimated gradient: 2.3930706947460813e-05

1.15. The Hertzsprung-Russell Diagram#

The Hertzsprung-Russell Diagram is a scatter plot of stars showing the relationship between the stars’ absolute magnitudes or luminosities versus their temperatures.

Let’s see if we can obtain something similar from our data:

1.15.1. Exercise#

  • Make a scatter plot from our star data. Plot each star type ‘Super Giants’, ‘Main sequence’ and ‘White Dwarf’ in different colours.

  • Can you observe similar star clusters? # Hint: You might need to use logarithmic scales for the axis and reverse the direction of the temperature axis.

## Your code here