Reading and plotting

3. Reading and plotting#

3.1. Reading different data formats#

Data comes in many differents forms. Few common data formats are:

Tabular data. Often fields separated by tabs or variable number of spaces
Comma Separated Values (CSV), or semicolon separated values
Excel spreadsheets, or libreoffice
Hierarchical Data Format, like HDF5, MATLAB native
Apache Parquet format
Feather file format
Common Data Format (CDF) or NetCDF
Databases usign Structured Query Language (SQL)
And many many others. See more from Wikipedia

3.2. What data format to use?#

A good data format allows storing metadata with the data, can store data structures like matrices, can natively handle various data formats, such as timestamps, numbers and strings, etc. Some data formats can also support transparent compression of data. With bigger data sets, possibilities to store data in binary form is also a benefit.
The common text and spreadsheet based 2-dimensional data formats are suitable for small data sets, if you do not need complex data structures:
- CSV is probably the most common data format, but it is not well standardized. Therefore CSV reading requires often plenty of parsing before it works. It is also verbose and requires a lot of space. Almost the same applies to tabular data.
- Spreadsheet formas may be even a little better, but they also support only 2-dimensional data sets, and including metadata is not supported.
For lighter data sets, the Feather is a very good option. It is fast and interoperable. Based on Apache Arrow library, which makes reading and writing of data very fast.
Parquet, designed for Haddop, is often used in Big data domain. Use it if you need to store huge data sets.
Hierarchical data format (HDF) supports transparent compression, metadata and various data structures. It can be explored without external documentation of the content. It is commonly used for many purposes. For example, MATLAB uses it. NetCDF is a simplified subset of HDF. Use HDF instead of NetCDF if possible. Note: Using HDF5 requires installing pytables module in Python.

3.3. Asphalt heat data#

The data is clearly in CSV-file with three columns, where field separator is semicolon. The first column is a timestamp in ISO format. Other columns are numerical and they use decimal points. The first row is a colun header.

Timestamp; Irradiance; Heatflux 2014-10-01 03:00:20; 0.2; -83.1 2014-10-01 03:15:20; 0.2; -82.8 2014-10-01 03:30:20; 0.1; -82.5 2014-10-01 03:45:20; 0.1; -82.3 2014-10-01 04:00:20; 0.1; 2014-10-01 04:15:20; 0.1; -82.2 ......

3.4. Pandas#

Parsing simple CSV data files is easy with many methods, but it is easiest, if you use Pandas statistics package. And because Pandas provides so many other features as well, let’s take a closer look on that. Read 10 minutes to Pandas at this stage.

It is really convenient to let Pandas parse the timestamps by defining the columns whics are being parsed with parse_dates argument. You can just list the columns which contain timestamp values.

This data is clearly time series data, so it is best to be indexed with a timestamp. In this way you can very easily select data based on the timestamp and if you plot the data, you get time axis automatically in the x-axis. This can be achieved by defining the column which is used as an index with index_col parameter.

It is really inconvenient that by default Pandas interpretes whitespaces as important in the beginning of fields. For example, in the previous dataset, the header included the column names in following format Timestamp; Irradiance; Heatflux. There is a whihtespace in between semicolon and the next field name, ; Irradiance for instance. By default Pandas parses ‘ Irradiance’ as the name of this field. It has some consequences. Therefore it is good practice to use skipinitialspace parameter to prevent this.

Now we are well prepared to read the data using pd.read_csv function, which returns a data structure, called as Pandas dataframe. It is a little bit similar than data frames in R. Basically it is simiar than numpy array, but it has also many more advanced features.

After reading the pandas dataframe, we can print the shape (number of rows and columns in the dataframe) and the head (the first rows) of the dataframe.

import numpy as np
import pandas as pd
import matplotlib.pylab as plt

D=pd.read_csv('data/asphaltHeat.csv', sep=';', parse_dates=[0], index_col=0, skipinitialspace=True)
#D=pd.read_csv('data/asphaltHeat.csv', sep=';', skipinitialspace=True)
print(D.shape)
D.head(6)

(2968, 2)

	Irradiance	Heatflux
Timestamp
2014-10-01 03:00:20	0.2	-83.1
2014-10-01 03:15:20	0.2	-82.8
2014-10-01 03:30:20	0.1	-82.5
2014-10-01 03:45:20	0.1	-82.3
2014-10-01 04:00:20	0.1	NaN
2014-10-01 04:15:20	0.1	-82.2

3.4.1. Try feather#

The format supports many kinds of data structures and no parsing is needed. The data just appears back to data structure without skipping spaces of parsing the dates in any specific way. The storage space is shrink to 1/3 then using feather instead of CSV.

Pandas supports many data formats, so using them is very simple.

D.to_feather('data/asphaltHeat.feather')
D.to_parquet('data/asphaltHeat.parquet')
D.to_hdf('data/asphaltHeat.hdf', key='asphalt')

featherD=pd.read_feather('data/asphaltHeat.feather')
featherD.head(6)

	Irradiance	Heatflux
Timestamp
2014-10-01 03:00:20	0.2	-83.1
2014-10-01 03:15:20	0.2	-82.8
2014-10-01 03:30:20	0.1	-82.5
2014-10-01 03:45:20	0.1	-82.3
2014-10-01 04:00:20	0.1	NaN
2014-10-01 04:15:20	0.1	-82.2

3.5. Missing data#

In the original CSV-file, Heatflux data corresponding timestamp 2014-10-01 04:00:20 is missing
It is represented by ‘NaN’ in ther parsed dataframe, which is which is a floating point representation of Not a Number value.
Generally it is said that this value is not available (na).
Missing data can cause bias to statistical operations
Arithmetic operations with not a number values are undefined.

a=3.141
b=np.nan
print("%f + %f = %f\n" % (a,b, a+b))

3.141000 + nan = nan

But many statistical functions ni Pandas have buildin strategy for handling missing data. The result will be calculated, but there is still a danger of bias.

# Default behaviour for mean() function is to skip NaN values
print(D.head(6).mean())
print("\n")

# It can be explicitely set to include them, use "help(D.mean)" to get more help
print(D.head(6).mean(skipna=False))

# Describe provides a lot of statistics of the data, it ignores the NA values
# But look at the Count row. There seems to be lot less values in Heatflux
# Those are probably missing values
D.describe()

Irradiance     0.133333
Heatflux     -82.580000
dtype: float64


Irradiance    0.133333
Heatflux           NaN
dtype: float64

	Irradiance	Heatflux
count	2968.000000	2952.00000
mean	33.743093	-2.23269
std	69.410991	10.00759
min	-0.500000	-84.90000
25%	0.100000	0.00000
50%	0.300000	0.00000
75%	34.525000	0.00000
max	513.200000	36.20000

D.mean()

Irradiance    33.743093
Heatflux      -2.232690
dtype: float64

3.5.1. Detecting NAs#

NA values can be detected using isna() function

D.head(6).isna()

	Irradiance	Heatflux
Timestamp
2014-10-01 03:00:20	False	False
2014-10-01 03:15:20	False	False
2014-10-01 03:30:20	False	False
2014-10-01 03:45:20	False	False
2014-10-01 04:00:20	False	True
2014-10-01 04:15:20	False	False

D.Heatflux.isna()

Timestamp
2014-10-01 03:00:20    False
2014-10-01 03:15:20    False
2014-10-01 03:30:20    False
2014-10-01 03:45:20    False
2014-10-01 04:00:20     True
                       ...  
2014-10-31 22:45:20    False
2014-10-31 23:00:20    False
2014-10-31 23:15:20    False
2014-10-31 23:30:20    False
2014-10-31 23:45:20    False
Name: Heatflux, Length: 2968, dtype: bool

3.5.2. Selecting based on NA status#

The boolean vector about NA status can be used for selecting only NA rows or selecting all but NA rows. Notice that ~ is boolean NOT operator in Python.

# Select all but NA rows
b=D.Heatflux.isna()
D[~b]

	Irradiance	Heatflux
Timestamp
2014-10-01 03:00:20	0.2	-83.1
2014-10-01 03:15:20	0.2	-82.8
2014-10-01 03:30:20	0.1	-82.5
2014-10-01 03:45:20	0.1	-82.3
2014-10-01 04:15:20	0.1	-82.2
...	...	...
2014-10-31 22:45:20	0.2	0.0
2014-10-31 23:00:20	0.2	0.0
2014-10-31 23:15:20	0.2	0.0
2014-10-31 23:30:20	0.3	0.0
2014-10-31 23:45:20	0.3	0.0

2952 rows × 2 columns

~b

Timestamp
2014-10-01 03:00:20     True
2014-10-01 03:15:20     True
2014-10-01 03:30:20     True
2014-10-01 03:45:20     True
2014-10-01 04:00:20    False
                       ...  
2014-10-31 22:45:20     True
2014-10-31 23:00:20     True
2014-10-31 23:15:20     True
2014-10-31 23:30:20     True
2014-10-31 23:45:20     True
Name: Heatflux, Length: 2968, dtype: bool

# All rows where Heatflux values is not known can be easily detected
# by indexing the dataframe with the boolean vector given by isna() function directly
D[D.Heatflux.isna()]

# But it is the same to use the previous b-vector like this:
#D[b]

	Irradiance	Heatflux
Timestamp
2014-10-01 04:00:20	0.1	NaN
2014-10-01 08:30:20	86.9	NaN
2014-10-01 16:15:20	186.2	NaN
2014-10-02 02:30:20	0.1	NaN
2014-10-02 02:45:20	0.1	NaN
2014-10-02 03:00:20	0.1	NaN
2014-10-02 03:15:20	0.1	NaN
2014-10-02 06:00:20	-0.0	NaN
2014-10-02 06:30:20	0.4	NaN
2014-10-02 06:45:20	2.7	NaN
2014-10-02 07:00:20	6.0	NaN
2014-10-02 07:15:20	13.2	NaN
2014-10-02 07:30:20	12.9	NaN
2014-10-02 07:45:20	19.4	NaN
2014-10-02 08:00:20	32.4	NaN
2014-10-02 08:15:20	118.4	NaN

3.5.3. Examples of data frame indexing#

Data frame allows very convenient indexing possibilities if the key is a datetime object. One can select rows where the content of the datetime object match to specified value. For example it is possible to select rows where the hour-field of the timestamp is a certain number, thus selecting from the data all values which represent a middle of the day.

D[D.index.hour==12]

	Irradiance	Heatflux
Timestamp
2014-10-01 12:00:20	438.5	36.2
2014-10-01 12:15:20	430.4	33.6
2014-10-01 12:30:20	119.2	22.6
2014-10-01 12:45:20	288.5	12.6
2014-10-02 12:00:20	89.2	-0.0
...	...	...
2014-10-30 12:45:20	232.0	0.0
2014-10-31 12:00:20	60.2	0.0
2014-10-31 12:15:20	77.5	0.0
2014-10-31 12:30:20	106.0	0.0
2014-10-31 12:45:20	124.0	0.0

124 rows × 2 columns

3.5.4. Other methods for handling missing data#

The default handling of missing data is to skip them. The dataframe can be also be preprocessed to handle the missing data in following ways:

Drop all rows which has missing data. Can cause bias and loss of valuable data in other columns.
Drop all columns which has missing data. Causes loss of data.
Fill missing data with other values, imputation, such as:
Constants, like 0, np.inf, etc
Mean value
Previous or next available value
Interpolated value
Value from similar record (Hot deck)

3.5.5. Drop missing values#

Obsever how the shape of the data matrix is reduced.

print(D.shape)
print(D.dropna().shape)

(2968, 2)
(2952, 2)

D.Irradiance

Timestamp
2014-10-01 03:00:20    0.2
2014-10-01 03:15:20    0.2
2014-10-01 03:30:20    0.1
2014-10-01 03:45:20    0.1
2014-10-01 04:00:20    0.1
                      ... 
2014-10-31 22:45:20    0.2
2014-10-31 23:00:20    0.2
2014-10-31 23:15:20    0.2
2014-10-31 23:30:20    0.3
2014-10-31 23:45:20    0.3
Name: Irradiance, Length: 2968, dtype: float64

D.dropna() drops all rows which has na-values. It has clearly dropped 16 rows and the Irradiance field does not have NAs any longer.

3.5.6. Dropping and Imputing data#

NA-values can be dropped or filled, but both methods may affect the statistics.

Observe that when using various strategies, the mean is always different. What is the best strategy?

print(D.Irradiance.mean())
print(D.dropna().Irradiance.mean())

33.74309299191375
33.763719512195124

# Drop missing data, and then calculate the mean Heatflux
print(D.Heatflux.mean())  # Default behavior, do not use missing values
print(D.dropna().Heatflux.mean()) # Drop missing rows explicitely

# Impute missing values, and then calculate the mean of Heaflux
print(D.fillna(0).Heatflux.mean()) # Fill missing values with zeros
print(D.ffill().Heatflux.mean()) # Copy previous value in place of missing value
print(D.interpolate().Heatflux.mean()) # Interpolate missing values

-2.2326897018970193
-2.2326897018970193
-2.2206536388140163
-2.273247978436658
-2.272254043126685

D.interpolate().head(6)

	Irradiance	Heatflux
Timestamp
2014-10-01 03:00:20	0.2	-83.10
2014-10-01 03:15:20	0.2	-82.80
2014-10-01 03:30:20	0.1	-82.50
2014-10-01 03:45:20	0.1	-82.30
2014-10-01 04:00:20	0.1	-82.25
2014-10-01 04:15:20	0.1	-82.20

D.loc['2014-10-01'].interpolate().mean().plot.bar()

<Axes: >

_images/92b8af87a2811796fafb43d516aa372db200cdce1ff2058955775047492ee4bb.png

3.6. Summary#

There is often missing values in the data. They can be handled with many methods, but all of them have consequences.

Always study how much missing values there is in the data
Be aware what is default missing data handling method and think if it is the best in your case
Pandas and other sofware support many missing data handling methods if the default is not sufficient
Make sure that you know how your missing data hanling method affects to the statistics

3.7. Slicing pandas data frame#

Pandas dataframes support really versatile methods for indexing in columnvise and rowvise directions

Dataframes can be indexed using column names either using dot-notation or column name in square brackets. Using column names is usefull if your data structure later changes. It makes also the code as easier to understand.

print(D.columns)

# THese two lines produce identical results
D.Irradiance.head()
D['Irradiance'].head()

Index(['Irradiance', 'Heatflux'], dtype='object')

Timestamp
2014-10-01 03:00:20    0.2
2014-10-01 03:15:20    0.2
2014-10-01 03:30:20    0.1
2014-10-01 03:45:20    0.1
2014-10-01 04:00:20    0.1
Name: Irradiance, dtype: float64

Dataframes can also be indexed using row index. If it is a time field, it can handle many different time formats.

Notice that when only the day is specified in datetime index, every timestamp from the specified day will be selected.

D.loc['2014-10-15', 'Irradiance'] 

Timestamp
2014-10-15 00:00:20    0.3
2014-10-15 00:15:20    0.3
2014-10-15 00:30:20    0.3
2014-10-15 00:45:20    0.3
2014-10-15 01:00:20    0.3
                      ... 
2014-10-15 22:45:20    0.2
2014-10-15 23:00:20    0.2
2014-10-15 23:15:20    0.3
2014-10-15 23:30:20    0.3
2014-10-15 23:45:20    0.3
Name: Irradiance, Length: 96, dtype: float64

Dataframes support also MATLAB or R like location based indexing:

D.iloc[0:5, :2]

	Irradiance	Heatflux
Timestamp
2014-10-01 03:00:20	0.2	-83.1
2014-10-01 03:15:20	0.2	-82.8
2014-10-01 03:30:20	0.1	-82.5
2014-10-01 03:45:20	0.1	-82.3
2014-10-01 04:00:20	0.1	NaN

Read more from the tutorialspoint article Python Pandas - Indexing and Selecting Data.

3.7.1. Plotting#

The data in the dataframe can be plotted almost automatically

Decorations can be added easily. Notice that x and y labels support LaTeX format, to add mathematical notation. The time representation is automatically adjusted, when the datafram index is a datetime object.

# Select one day, and plot the data
(D.loc['2014-10-01']).plot()
#(D.loc['2014-10']).plot()

<Axes: xlabel='Timestamp'>

_images/4fb3f4e4a1da01d8142d54c5da131e4840f03df6486b9d0c33dc231dcf1c8734.png

# Lets make some decorations to the plot
ax=(D.loc['2014-10-01']).plot()
ax.grid()
ax.set_ylabel('Power [W/$m^2$]')
ax.set_xlabel('Time')

Text(0.5, 0, 'Time')

_images/f4ccb9f3ec953c17bf25c7231bee0d8c04e89eecdc591029fe2a20a5b7238433.png

D.columns

Index(['Irradiance', 'Heatflux'], dtype='object')

Notice that the missing values are causing some breaks in the lines. They can be fixed with imputation.

In this case, the interpolation is clearly the best method to handle missing data. Otherwise total power integration would be more wrong.

The figures can be also saved as images with various formats.

fig=(D.loc['2014-10-01']).interpolate().plot()
#ax=(D.loc['2014-10-01']).dropna().plot()
fig.grid()
fig.set_ylabel('Power [W/$m^2$]')
plt.savefig('figure1.svg')
plt.savefig('figure1.png')

_images/3641fee8bbf782fabfcd33150a8f97beb5d767183b5b1dee8de649d354a078bb.png

3.8. More challenging dataset#

(Very difficult actually)

Water quality data (CSV)

name of the station;ID-number of the station;ETRS-coordinates east;ETRS-coordinates north;max depth of the station;date;sampling depth;Faecal enterococci (kpl/100ml);Oxygen saturation (kyll.%);Dissolved oxygen (mg/l);Suspended solids, coarse (mg/l);Chlorophyll a (µg/l);Total phosphorous, unfiltered (µg/l);Total nitrogen, unfiltered (µg/l);Coliform bacteria thermotolerant (kpl/100ml);Temperature (°C);Nitrate as nitrogen, unfiltered (µg/l);Nitrate as nitrogen, unfiltered (µg/l);Nitrite nitrate as nitrogen, unfiltered (µg/l);Secchi depth (m);pH ;Salinity (<89>);Turbidity (FNU);Conductivity (mS/m) Et kaup selkä 1;5520;227911;7005357;2.4;19.3.1974 0:00;;;;;;;;;;;;;;0.5;;;; Et kaup selkä 1;5520;227911;7005357;2.4;19.3.1974 0:00;1,0;0;56;7.9;;;14;4100;0;0.3;;;;;4.6;;8.1;231 Et kaup selkä 1;5520;227911;7005357;2.4;12.6.1974 0:00;;;;;;;;;;;;;;1.1;;;; Et kaup selkä 1;5520;227911;7005357;2.4;12.6.1974 0:00;1,0;0;104;10.3;;;20;410;0;14.5;;;;;8.1;;;627 Et kaup selkä 1;5520;227911;7005357;2.4;21.10.1974 0:00;;;;;;;;;;;;;;1;;;; Et kaup selkä 1;5520;227911;7005357;2.4;21.10.1974 0:00;1,0;3;93;12.2;;;20;1200;79;2.9;;;;;7.2;;;594 Et kaup selkä 1;5520;227911;7005357;2.4;4.6.1975 0:00;;;;;;;;;;;;;;1;;;; Et kaup selkä 1;5520;227911;7005357;2.4;4.6.1975 0:00;1,0;2;102;11;;;40;560;17;10.5;;;;;6.7;;7.8;390

The dataset is clearly in CSV-format, and it has semicolon separated values. Some columns are numerical and some others are strings. Column names are rather long strings. The fifth column (column number 4, if indexed from zero) is the timestamp. Lets read it:

WD=pd.read_csv('data/waterquality.csv', sep=';', parse_dates=[5], dayfirst=True, index_col=5, skipinitialspace=True, encoding='latin1')
print(WD.shape)

# Show five first rows, and all columns starting from column 4 (fifth column)
WD.loc['1987-07'].iloc[:,4:]

(790, 23)

	max depth of the station	sampling depth	Faecal enterococci (kpl/100ml)	Oxygen saturation (kyll.%)	Dissolved oxygen (mg/l)	Suspended solids, coarse (mg/l)	Chlorophyll a (µg/l)	Total phosphorous, unfiltered (µg/l)	Total nitrogen, unfiltered (µg/l)	Coliform bacteria thermotolerant (kpl/100ml)	Temperature (°C)	Nitrate as nitrogen, unfiltered (µg/l)	Nitrate as nitrogen, unfiltered (µg/l).1	Nitrite nitrate as nitrogen, unfiltered (µg/l)	Secchi depth (m)	pH	Salinity ()	Turbidity (FNU)	Conductivity (mS/m)
date
1987-07-13 12:00:00	2.4	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2.0	NaN	NaN	NaN	NaN
1987-07-13 12:00:00	2.4	0,0-1,2	NaN	NaN	NaN	NaN	4.4	12.0	650.0	NaN	16.6	330.0	2.0	NaN	NaN	NaN	NaN	NaN	NaN
1987-07-13 12:00:00	2.4	0,0-1,5	NaN	NaN	NaN	NaN	3.2	11.0	720.0	NaN	16.3	410.0	2.0	NaN	NaN	NaN	NaN	NaN	NaN
1987-07-27 12:00:00	2.4	0,0-2,0	NaN	NaN	NaN	NaN	1.7	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

WD.columns

Index(['name of the station', 'ID-number of the station',
       'ETRS-coordinates east', 'ETRS-coordinates north',
       'max depth of the station', 'sampling depth',
       'Faecal enterococci (kpl/100ml)', 'Oxygen saturation (kyll.%)',
       'Dissolved oxygen (mg/l)', 'Suspended solids, coarse (mg/l)',
       'Chlorophyll a (µg/l)', 'Total phosphorous, unfiltered (µg/l)',
       'Total nitrogen, unfiltered (µg/l)',
       'Coliform bacteria thermotolerant (kpl/100ml)', 'Temperature (°C)',
       'Nitrate as nitrogen, unfiltered (µg/l)',
       'Nitrate as nitrogen, unfiltered (µg/l).1',
       'Nitrite nitrate as nitrogen, unfiltered (µg/l)', 'Secchi depth (m)',
       'pH  ', 'Salinity ()', 'Turbidity (FNU)', 'Conductivity (mS/m)'],
      dtype='object')

There seems to be a lot of NA values in the data. Futher examination reveals that there are often two records for the same timestamp. The first record has only couple of values and the rest are NAs, whereas the second record contains most other values, but those couple of values given in the previous record are NAs. Obviously these succeeding rows needs to be merged. This can be done “easily” by groubing the data by timestamp, and using the first value of each column which is not NA.

# This, a little bit complex statement, chains severa sequential actions together
# 1) group the dataframe WD by an index, called as 'date'
# 2) apply an aggregate fuction first() to the groupped dataframe to replace the 
#    value of each column in a group with the first value observed witin a group
# 3) Take take values observed in July 1987 from the corrected dataframe for further study
# 4) Select all colums, except the first four columns from the resulting dataframe
WD.groupby('date').first().loc['1987-07'].iloc[:,4:]

	max depth of the station	sampling depth	Faecal enterococci (kpl/100ml)	Oxygen saturation (kyll.%)	Dissolved oxygen (mg/l)	Suspended solids, coarse (mg/l)	Chlorophyll a (µg/l)	Total phosphorous, unfiltered (µg/l)	Total nitrogen, unfiltered (µg/l)	Coliform bacteria thermotolerant (kpl/100ml)	Temperature (°C)	Nitrate as nitrogen, unfiltered (µg/l)	Nitrate as nitrogen, unfiltered (µg/l).1	Nitrite nitrate as nitrogen, unfiltered (µg/l)	Secchi depth (m)	pH	Salinity ()	Turbidity (FNU)	Conductivity (mS/m)
date
1987-07-13 12:00:00	2.4	0,0-1,2	NaN	NaN	NaN	NaN	4.4	12.0	650.0	NaN	16.6	330.0	2.0	NaN	2.0	NaN	NaN	NaN	NaN
1987-07-27 12:00:00	2.4	0,0-2,0	NaN	NaN	NaN	NaN	1.7	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

There are still a lot of missing values, but the confusion of which value to select for a certain time is now gone.

Lets study the average temperature over the whole data.

# Temperaturecolumn
i=14

# Plot individual observations as circles
ax=WD.iloc[:,i].plot(style='o')
ax.set_ylabel('Temperature [$^O$ C]')

# Resample the data so that it has only one value every 
# 5 years. The value is obtained by calculating an average
# or selecting the maximum value over 5 years period
# Then plot the resulting resampled data
WD.iloc[:,i].resample('5YE').mean().plot(linewidth=4, figsize=(12,8))
WD.iloc[:,i].resample('5YE').max().plot(linewidth=4)

# Save the images to the directory output as different formats
plt.savefig('output/Temperatureprofile.pdf')
plt.savefig('output/Temperatureprofile.png')
plt.savefig('output/Temperatureprofile.svg')

_images/3be5506f4604b3868f48e5d342d8a33f2a51b08b9730a9eeb8e83eb94c92ed79.png

Study the general statistics of the data. Check for example

How many missing values are in chlorophyll observations
What is typical chlorophyll value, and within what range it is varying

WD.iloc[:,10:].describe()

	Chlorophyll a (µg/l)	Total phosphorous, unfiltered (µg/l)	Total nitrogen, unfiltered (µg/l)	Coliform bacteria thermotolerant (kpl/100ml)	Temperature (°C)	Nitrate as nitrogen, unfiltered (µg/l)	Nitrate as nitrogen, unfiltered (µg/l).1	Nitrite nitrate as nitrogen, unfiltered (µg/l)	Secchi depth (m)	pH	Salinity ()	Turbidity (FNU)	Conductivity (mS/m)
count	187.000000	266.000000	260.000000	114.000000	360.000000	30.000000	26.000000	75.000000	282.000000	258.000000	110.000000	223.000000	264.000000
mean	7.655615	20.163534	1080.038462	11.263158	11.686667	304.433333	1.884615	456.253333	1.295390	7.104729	3.276545	5.635471	583.718182
std	5.962969	18.260188	978.456944	22.860892	7.505963	597.447368	2.355027	697.891537	0.534733	0.883085	1.193004	6.581787	189.529933
min	0.300000	2.000000	140.000000	0.000000	-0.200000	2.000000	0.000000	2.000000	0.200000	4.300000	0.100000	0.460000	25.600000
25%	3.700000	11.000000	470.000000	1.000000	2.850000	2.000000	0.000000	2.000000	0.900000	6.600000	2.425000	1.750000	477.500000
50%	5.900000	15.500000	665.000000	3.000000	14.400000	43.500000	1.500000	51.000000	1.300000	7.300000	3.500000	3.200000	615.000000
75%	10.500000	22.000000	1400.000000	11.000000	17.900000	367.500000	2.750000	680.000000	1.800000	7.800000	4.275000	6.400000	739.250000
max	36.000000	218.000000	6700.000000	130.000000	24.000000	3100.000000	11.000000	2900.000000	2.500000	8.400000	5.100000	42.000000	890.000000