Common Data Structures in Python

Common Data Structures in Python#

There are many different data structures that are used in python. Most prominently used are numpy arrays, pandas dataframes and dictionaries.

In this notebook, we will talk about these and how they are used within the python language

Numpy

Numpy Arrays#

An array is a datastructure we can use to store numerical information in. Arrays can be n-dimensional, but typically they are 1- or 2-Dimensional. 2-Dimensional Arrays are very often used to represent images (images are only a combination of 0 (black values) and 255 (white values)). The typical structure of an 2-D Array is something we call a row and a column.

      This is a 1-Dimensional Array
      np.array([0,1,2,3])                     (shape: (1 Row, 4 columns) ; Data from 1 dog that evaluated her favorite snack)

      This is a 2-Dimensional Array
      np.array([[0,1,2,3],                    (shape: (2 Rows, 4 columns); Data from 2 dogs that evaluated their favorite snacks)
                [4,5,6,7]])

What you can see in this example is actually nothing more than calling the np.array() function and passing a list [0,1,2,3] to it

Within a numpy array, the stored datatypes must be homogeneous meaning that they all need to belong to the same data type. Numpy arrays are optimized to be used for numerical computations.

Numpy arrays come with a fixed size!

We will now start to explore the numpy environment and the numpy arrays.

So first, start by importing numpy as np

import numpy as np

We will not work with real data yet, but rather simulate our own numpy arrays to work with. Numpy offers some really useful functions we can use to generate our arrays.

Lets start with the numpy.random.rand function.

They key argument we need to pass to the function is the shape of the array we want to create. Lets start by creating a 1-Dimensional Array first.

The first argument of numpy.random.rand function determines the number of rows we want our array to have, where as the second argument determines the number of columns.

Exercise 15.0#

Create a 1-D numpy array using the rand function from the random module (from the numpy package). Use it to create a numpy array with 1 Row and 20 Columns.

Assign your array to a variable called “RandomArray”.

RandomArray = np.random.rand(1,20)

The information about the shape (e.g, how many rows and colum we have) is actually stored within the array element itself. We can access it with array.shape

RandomArray.shape

(1, 20)

We can also create arrays with zeros or ones

np.zeros((1,5))
np.ones((1,5))

Now in numpy, we can use multiple methods to easily extract information

RandomArray.min() #min value

RandomArray.max() #max value

RandomArray.argmin() #index of min value in that array

RandomArray.argmax() #index of max value in that array

Indexing a 1-Dimensional Numpy Array works similiar to slicing in lists.

      1D-Array[0] #element zero
      1D-Array[0:10] #elements zero to 10
      1D-Array[0:2:10]#elements zero to 10 in steps of 2

Technically, our RandomArray is a 2-D matrix. To access the first 10 columns we need to index like this

SlicedArray = RandomArray[:,:10]

      [: tells us, that we want to index into all rows of the array

      ,:10] tells us, that we want to index into the first 10 columns of the matrix

And we can also use the np.arange function to create a numpy array with values in a given range

values = np.arange(0,10)
values, values.dtype

(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), dtype('int32'))

Since SlicedArray and values have the same shape, we can perform any mathematical operation with them.

The operations are element wise. So first element of matrix 1 * first element of matrix 2 and so on.

values * SlicedArray

array([[0.        , 0.27416264, 0.65348972, 2.38691933, 0.05090751,
        4.33516312, 3.4804216 , 6.213112  , 3.76518539, 7.90406815]])

We can also use logicals to compare and access numpy arrays

values > SlicedArray

array([[False,  True,  True,  True,  True,  True,  True,  True,  True,
         True]])

This returns an array of boolean values, where the condition is either True or False. We can use this output of boolean values as a mask. Masks are basically an accelarated version of if element in list, append to another list, else append to another list thing we practised before.

Disadvantage: This mask performs the comparison element by element.

mask = values > SlicedArray
values.reshape(1,10)[mask]

array([1, 2, 3, 4, 5, 6, 7, 8, 9])

for idx, element in enumerate(values):
          if element > SlicedArray[0][idx]:
                    print(element)
          else:
                    continue

With the np.random.choice function, we can pass a numpy array and get a back a random set of elements from that array. We can determine the number of iterations with the size parameter.

But what’s really cool is what happens behind the scenes.

Even though you’re writing Python code, a lot of NumPys operations, like random.choice, are actually powered by code written in C. That’s because Python is a high-level, interpreted language, which means it’s very readable and flexible, but not the fastest when it comes to numerical operations or looping over large datasets.

On the other hand, C is a low-level, compiled language, which means it runs much faster. So to get the best of both worlds — Python’s simplicity and C’s speed — many libraries like NumPy are written in C or Cython (a Python-like language that compiles to C) under the hood, and then “wrapped” in Python. This way, you write code that looks and feels like Python, but it’s executed at C speed behind the scenes.

menu = np.array(["Espresso", "Latte", "Cappuccino", "Americano", "Mocha"])
orders = np.random.choice(menu, size=50)

Under the hood, np.random.choice is using a C-based loop. Eventhough vanilla python is slower than numpy, we can still show that the logic of our for loops apply here!

Exercise#

Create a function called random_choice. The goal of the function is to return a similar output as np.random.choice. Since this function uses a loop under the hood, integrate a loop within your function as well. The output should be a numpy array. Try to use the same arguments as in np.random.choice (array as its input and the size parameter, that defines the the number of iterations).

def random_choice (list,num_iterations):
  choice = []
  for i in range(num_iterations):
   ran = np.random.randint(0,len(list))
   element = list[ran]
   choice.append(element)

  return np.array(choice)

Exercise#

Using the np.arange function, create one numpy array with values from 1 to 512 in steps of 2 and call the array all_scores.

Create another numpy array with values from 1 to 257 in steps of 1 and call it high_scores.

Your task is to find out, which values from high_scores are actually in all_scores.

Create a mask of boolen values, which should be True if an element of high_scores is in all_scores, and False otherwise. Use the np.isin(firstarray,secondarray) function to create the mask. Which array should you pass as the first argument, and which as the second? Use np.isin? to find out!

Use this mask to create a new array called valid_high_scores. How do these array differ in their shape and distribution (mean+standard deviation)?

all_scores = np.arange(1,512,2)
high_scores = np.arange(1,257)

mask = np.isin(high_scores,all_scores)

valid_high_scores = high_scores[mask]

Exercise#

Create a numpy array of 9 numbers using np.arange. Reshape it to a 2D matrix of the shape (3,3).

Hint: Use the np.array.reshape() method.

np.arange(9).reshape(3,3)

Exercise#

   array([[ 1,  2,  3,  4],
          [ 5,  6,  7,  8],
          [ 9, 10, 11, 12]])

What element woud you expect to see if we index into the array using (array[2,3]) ?

Answer: 12 (Row:3, Column:4)

Exercise#

Create a numpy array with random integers between 1 and 255, with the shape 64,64.

Hint: Use the np.random.randint function

np.random.randint(1,255,size=(64,64))

(3, 4)

pandas

Pandas#

So pandas is the python library you want to use for organizing, manipulating and analyzing your datasets. The standard datatype used in pandas are dataframes. Pandas dataframes are basically like an excel sheet, but way, way better.

In this section, we will download a dataset and use this to explore dataframes and apply what we have learned so far.

But first the basic import:

      import pandas as pd

This is the way to go. Again, you can use what ever abbreviation you want to, but I dont think I saw any code where some just used pandas.xyz or pandas as p or something strange like that.

import pandas as pd

Before transitioning to the dataset, you should know a thing or two about pandas.

The cool thing here is that you can actually convert lists or numpy arrays to a dataframe.

What you usually want to do is put your lists or arrays in to a dictionary. A dictionary is a further vanilla python datatype. Its syntax goes like this:

      dictionary = {"Participant Number":[0,1,2,3,4],
                    "Reaction times":[100,50,76,34,95]}      

The string input is what we call a key (Participant Number, Reaction times) The key basically stores the values that are associated with it. We wont focus on dictionaries too much here, but you should know, that a dictionary can store values (or lists of values, or arrays) in so called keys.

What we can now do is create a dataframe from this dictionary.

'''This code cell created a dictionary called coffee_order_dictionary. It has four keys called Customer_ID, Drink, Size, Time_to_prepare_sec. 
The values are created by using the np.arange function, which gives values from (start) to (stop). We randomly create Drink, Size and time to prepare values using the np.random.choice function. 

'''
coffee_order_dictionary = {
    "Customer_ID": np.arange(1, 51),
    "Drink": np.random.choice(
        ["Latte", "Espresso", "Cappuccino", "Americano", "Mocha"], size=50),
    "Size": np.random.choice(["Small", "Medium", "Large"], size=50), #np.random.choice picks one element at a time at random
    "Time_to_prepare_sec": np.random.randint(60, 300, size=50)
}
dataframe = pd.DataFrame(coffee_order_dictionary)

So the expected shape of our dataframe should look like this!

We can use the dataframe.head(n=n) method to display the first n-entries of our dataframe.

dataframe.head(n=10)

	Customer_ID	Drink	Size	Time_to_prepare_sec
0	1	Espresso	Small	92
1	2	Americano	Small	89
2	3	Cappuccino	Large	198
3	4	Cappuccino	Large	223
4	5	Cappuccino	Large	286
5	6	Cappuccino	Large	102
6	7	Latte	Small	100
7	8	Latte	Medium	271
8	9	Mocha	Small	282
9	10	Latte	Small	257

You can again use the ? here to gather more information about your dataframe.

dataframe?

Type:        DataFrame
String form:
Customer_ID       Drink    Size  Time_to_prepare_sec
           0            1    Espresso   Small        <...>        Mocha   Small                  282
           9           10       Latte   Small                  257
Length:      10
File:        c:\users\janos\anaconda3\lib\site-packages\pandas\core\frame.py
Docstring:  
Two-dimensional, size-mutable, potentially heterogeneous tabular data.

Data structure also contains labeled axes (rows and columns).
Arithmetic operations align on both row and column labels. Can be
thought of as a dict-like container for Series objects. The primary
pandas data structure.

Parameters
----------
data : ndarray (structured or homogeneous), Iterable, dict, or DataFrame
    Dict can contain Series, arrays, constants, dataclass or list-like objects. If
    data is a dict, column order follows insertion-order. If a dict contains Series
    which have an index defined, it is aligned by its index.

    .. versionchanged:: 0.25.0
       If data is a list of dicts, column order follows insertion-order.

index : Index or array-like
    Index to use for resulting frame. Will default to RangeIndex if
    no indexing information part of input data and no index provided.
columns : Index or array-like
    Column labels to use for resulting frame when data does not have them,
    defaulting to RangeIndex(0, 1, 2, ..., n). If data contains column labels,
    will perform column selection instead.
dtype : dtype, default None
    Data type to force. Only a single dtype is allowed. If None, infer.
copy : bool or None, default None
    Copy data from inputs.
    For dict data, the default of None behaves like ``copy=True``.  For DataFrame
    or 2d ndarray input, the default of None behaves like ``copy=False``.

    .. versionchanged:: 1.3.0

See Also
--------
DataFrame.from_records : Constructor from tuples, also record arrays.
DataFrame.from_dict : From dicts of Series, arrays, or dicts.
read_csv : Read a comma-separated values (csv) file into DataFrame.
read_table : Read general delimited file into DataFrame.
read_clipboard : Read text from clipboard into DataFrame.

Examples
--------
Constructing DataFrame from a dictionary.

>>> d = {'col1': [1, 2], 'col2': [3, 4]}
>>> df = pd.DataFrame(data=d)
>>> df
   col1  col2
0     1     3
1     2     4

Notice that the inferred dtype is int64.

>>> df.dtypes
col1    int64
col2    int64
dtype: object

To enforce a single dtype:

>>> df = pd.DataFrame(data=d, dtype=np.int8)
>>> df.dtypes
col1    int8
col2    int8
dtype: object

Constructing DataFrame from a dictionary including Series:

>>> d = {'col1': [0, 1, 2, 3], 'col2': pd.Series([2, 3], index=[2, 3])}
>>> pd.DataFrame(data=d, index=[0, 1, 2, 3])
   col1  col2
0     0   NaN
1     1   NaN
2     2   2.0
3     3   3.0

Constructing DataFrame from numpy ndarray:

>>> df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
...                    columns=['a', 'b', 'c'])
>>> df2
   a  b  c
0  1  2  3
1  4  5  6
2  7  8  9

Constructing DataFrame from a numpy ndarray that has labeled columns:

>>> data = np.array([(1, 2, 3), (4, 5, 6), (7, 8, 9)],
...                 dtype=[("a", "i4"), ("b", "i4"), ("c", "i4")])
>>> df3 = pd.DataFrame(data, columns=['c', 'a'])
...
>>> df3
   c  a
0  3  1
1  6  4
2  9  7

Constructing DataFrame from dataclass:

>>> from dataclasses import make_dataclass
>>> Point = make_dataclass("Point", [("x", int), ("y", int)])
>>> pd.DataFrame([Point(0, 0), Point(0, 3), Point(2, 3)])
   x  y
0  0  0
1  0  3
2  2  3

You can see, that this object is neatly organized. It has two columns, which correspond to the keys from our coffee_order_dictionary. With this dataframe, we now have the opportunity to do many differnt things. But, this would be pretty boring based on this dataframe. So we will load a different one in and check out pandas functionalities based on it.

The code we are using to obtain this data is not in python but bash.

!curl url https://raw.githubusercontent.com/JNPauli/IntroductionToPython/refs/heads/main/content/datasets/Stadt_Koeln_Statistischer_Datenkatalog.csv

<html><body><h1>400 Bad request</h1>
Your browser sent an invalid request.
</body></html>

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0curl: (6) Could not resolve host: url
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100    90  100    90    0     0    290      0 --:--:-- --:--:-- --:--:--   294

We now have temporarily download the Stadt_Koeln_Statistischer_Datenkatalog.csv file in our google colab session. This also means, that we can now load it into a pandas dataframe. The function we want to use for that is called

      pd.read_csv(yourfilename)

We use this function to read the Stadt_Koeln_Statistischer_Datenkatalog.csv file and store it in a dataframe called koeln_stats.

Sometimes, when reading in a .csv file, we need to pass the sep argument. This helps us from preventing that all columns will end up in a single one, thus rendering the dataframe useless. The sep argument tells pandas, based on what seperator it should read in the columns.

koeln_stats = pd.read_csv(
          "https://raw.githubusercontent.com/JNPauli/IntroductionToPython/refs/heads/main/content/datasets/Stadt_Koeln_Statistischer_Datenkatalog.csv",
                    sep=";")

C:\Users\janos\AppData\Local\Temp\ipykernel_5584\989657012.py:1: DtypeWarning: Columns (9,11,13,15,17,19,23,24,26,28,30,31,33,34,38,42,43,45,47,49,50,52,53,56,62,64,66,67,69,72,77,78,79,82,87,89,91,98,101,102,105,106,107,108,110,113,115,118,119,123,124,125,126,127,128,129,136,137,138,139,140,141,143,145,146,158,159,160,161,167,168,169,170,171,173,174) have mixed types. Specify dtype option on import or set low_memory=False.
  koeln_stats = pd.read_csv(

Because the column descriptions of the koeln_stats dataframe are not informative at all, we also need to download the descriptions df and read it into a pandas dataframe called koeln_stats_description. This will be useful later on.

koeln_stats_description = pd.read_csv(
          "https://raw.githubusercontent.com/JNPauli/IntroductionToPython/refs/heads/main/content/datasets/Beschreibung_Statistischer_Datenkatalog.csv",
          sep=";")

We can display our newly obtained by either simply typing and running koeln_stats in a code cell. If you are only interested in viewing the n-th first or last elements of your dataframe, you can use df.head(n=n) or df.tail(n=n), respectively.

koeln_stats.head(n=10)

	S_JAHR	S_RAUM	RAUM	S_RAUMEBENE	RAUMEBENE	A0002A	A0002P	A0022S	A0025A	A0027A	...	H0051S	H0052S	B0003A	B0004A	B0009A	B0022S	B0023S	B0025A	B0026P	B0026A
0	2012	0	0 / Stadt Köln	0	Gesamtstadt	180415.0	17,271948	41,90013762	1044555.0	46426	...	87,059374	483,255	5944	2941	3114	39,40239241	75,57050842	544630.0	7,522905	40972
1	2012	1	1 / Innenstadt	1	Stadtbezirke	21712.0	16,985457	40,86903262	127827.0	4428	...	93,269732	458,447	566	296	193	40,39537813	63,87377692	80841.0	2,508628	2028
2	2012	2	2 / Rodenkirchen	1	Stadtbezirke	14788.0	14,337793	43,45253054	103140.0	5331	...	91,860767	569,468	1187	450	348	44,27280396	85,89885062	53159.0	3,397355	1806
3	2012	3	3 / Lindenthal	1	Stadtbezirke	14132.0	9,872231	42,06031943	143149.0	6787	...	91,920257	525,04	1172	689	848	45,80771085	82,08048668	79889.0	1,126563	900
4	2012	4	4 / Ehrenfeld	1	Stadtbezirke	19811.0	18,779445	40,54831047	105493.0	3935	...	83,775698	449,053	439	365	293	36,15031329	69,38747475	54961.0	12,57437	6911
5	2012	5	5 / Nippes	1	Stadtbezirke	20676.0	18,145597	42,27830898	113945.0	5154	...	84,574078	491,482	214	52	114	37,56269253	71,26460647	60059.0	7,341114	4409
6	2012	6	6 / Chorweiler	1	Stadtbezirke	14892.0	18,409049	42,06668212	80895.0	3493	...	84,421093	499,165	319	113	189	37,27096854	87,33156645	34524.0	23,473525	8104
7	2012	7	7 / Porz	1	Stadtbezirke	16613.0	15,235833	43,36416022	109039.0	5084	...	84,779095	553,792	453	147	316	39,70796687	82,05816466	52764.0	5,069744	2675
8	2012	8	8 / Kalk	1	Stadtbezirke	29259.0	25,468077	40,73158448	114885.0	5163	...	82,43595	377,211	915	441	417	35,27169778	73,36004852	55237.0	12,618353	6970
9	2012	9	9 / Mülheim	1	Stadtbezirke	28532.0	19,518135	41,96456415	146182.0	7051	...	81,935198	420,681	679	388	396	36,80407985	73,50256845	73196.0	9,794251	7169

10 rows × 175 columns

koeln_stats.tail(n=10)

	S_JAHR	S_RAUM	RAUM	S_RAUMEBENE	RAUMEBENE	A0002A	A0002P	A0022S	A0025A	A0027A	...	H0051S	H0052S	B0003A	B0004A	B0009A	B0022S	B0023S	B0025A	B0026P	B0026A
8198	2023	907030001	907030001 / Siedlung Klosterhof	3	Statistische Quartiere	557.0	21,514098	41,49510751	2589.0	158	...	91,377	442,255	NaN	NaN	NaN	NaN	NaN	NaN	NaN	50
8199	2023	907060001	907060001 / Siedlung Am Donewald	3	Statistische Quartiere	568.0	27,559437	38,91557496	2061.0	90	...	94,60204	361,475	NaN	NaN	NaN	NaN	NaN	NaN	NaN	571
8200	2023	908010001	908010001 / Stammheim-Nord	3	Statistische Quartiere	337.0	21,997389	40,79868364	1532.0	89	...	101,85955	1124,02	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0
8201	2023	908020001	908020001 / Stammheim-Süd - Adolf-Kober-Str.	3	Statistische Quartiere	609.0	24,477492	43,19138532	2488.0	220	...	95,658613	430,466	NaN	NaN	NaN	NaN	NaN	NaN	NaN	41
8202	2023	908020002	908020002 / Stammheim-Süd - Ricarda-Huch-Str.	3	Statistische Quartiere	399.0	26,181102	43,38019466	1524.0	116	...	88,834381	348,425	NaN	NaN	NaN	NaN	NaN	NaN	NaN	199
8203	2023	908030001	908030001 / Stammheim	3	Statistische Quartiere	367.0	14,864318	46,16612664	2469.0	196	...	92,187551	574,726	NaN	NaN	NaN	NaN	NaN	NaN	NaN	14
8204	2023	909010001	909010001 / Flittard	3	Statistische Quartiere	309.0	12,112897	45,61613093	2551.0	186	...	93,352664	589,964	NaN	NaN	NaN	NaN	NaN	NaN	NaN	14
8205	2023	909030001	909030001 / Bayer-Siedlung - Rungestr.	3	Statistische Quartiere	285.0	21,348315	43,81292135	1335.0	86	...	93,488372	496,629	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0
8206	2023	909030002	909030002 / Bayer-Siedlung - Roggendorfstr.	3	Statistische Quartiere	506.0	21,754084	44,40745916	2326.0	267	...	96,421722	490,541	NaN	NaN	NaN	NaN	NaN	NaN	NaN	12
8207	2023	909030003	909030003 / Bayer-Siedlung - Hufelandstr.	3	Statistische Quartiere	384.0	20,210526	40,03061404	1900.0	89	...	97,498829	512,631	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0

10 rows × 175 columns

If we take a closer look at a column like A0002P, we actually see, that the values look like this:

      17,271948

This is very problematic, because Python expects numbers to be seperated by a . A comma seperated value will be interpreted as a string. Problems like this can always happen, especially when you deal with uncleaned data.

Thats why we need to first define a function, that replaces the , with a . The function converts this replaced value to a float. All other values will be converted to the NaN (Not a Number) datatype.

def to_german_float(val):
    try:
        return float(str(val).replace(",", "."))
    except:
        return np.nan

# Apply to all object-type columns
for col in koeln_stats.select_dtypes(include='object').columns[2:]:
    koeln_stats[col] = koeln_stats[col].apply(to_german_float)

koeln_stats

	S_JAHR	S_RAUM	RAUM	S_RAUMEBENE	RAUMEBENE	A0002A	A0002P	A0022S	A0025A	A0027A	...	H0051S	H0052S	B0003A	B0004A	B0009A	B0022S	B0023S	B0025A	B0026P	B0026A
0	2012	0	0 / Stadt Köln	0	Gesamtstadt	180415.0	17.271948	41.900138	1044555.0	46426.0	...	87.059374	483.255	5944.0	2941.0	3114.0	39.402392	75.570508	544630.0	7.522905	40972.0
1	2012	1	1 / Innenstadt	1	Stadtbezirke	21712.0	16.985457	40.869033	127827.0	4428.0	...	93.269732	458.447	566.0	296.0	193.0	40.395378	63.873777	80841.0	2.508628	2028.0
2	2012	2	2 / Rodenkirchen	1	Stadtbezirke	14788.0	14.337793	43.452531	103140.0	5331.0	...	91.860767	569.468	1187.0	450.0	348.0	44.272804	85.898851	53159.0	3.397355	1806.0
3	2012	3	3 / Lindenthal	1	Stadtbezirke	14132.0	9.872231	42.060319	143149.0	6787.0	...	91.920257	525.040	1172.0	689.0	848.0	45.807711	82.080487	79889.0	1.126563	900.0
4	2012	4	4 / Ehrenfeld	1	Stadtbezirke	19811.0	18.779445	40.548310	105493.0	3935.0	...	83.775698	449.053	439.0	365.0	293.0	36.150313	69.387475	54961.0	12.574370	6911.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
8203	2023	908030001	908030001 / Stammheim	3	Statistische Quartiere	367.0	14.864318	46.166127	2469.0	196.0	...	92.187551	574.726	NaN	NaN	NaN	NaN	NaN	NaN	NaN	14.0
8204	2023	909010001	909010001 / Flittard	3	Statistische Quartiere	309.0	12.112897	45.616131	2551.0	186.0	...	93.352664	589.964	NaN	NaN	NaN	NaN	NaN	NaN	NaN	14.0
8205	2023	909030001	909030001 / Bayer-Siedlung - Rungestr.	3	Statistische Quartiere	285.0	21.348315	43.812921	1335.0	86.0	...	93.488372	496.629	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.0
8206	2023	909030002	909030002 / Bayer-Siedlung - Roggendorfstr.	3	Statistische Quartiere	506.0	21.754084	44.407459	2326.0	267.0	...	96.421722	490.541	NaN	NaN	NaN	NaN	NaN	NaN	NaN	12.0
8207	2023	909030003	909030003 / Bayer-Siedlung - Hufelandstr.	3	Statistische Quartiere	384.0	20.210526	40.030614	1900.0	89.0	...	97.498829	512.631	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.0

8208 rows × 175 columns

One last thing we need to do:

Some numbers are actually quite large in that dataframe. Pandas tends to use scientific notation to shorten the output, but this makes it hard to interpret at times. So lets change that!

pd.set_option('display.float_format', '{:,.2f}'.format)

This wont influence the to_german_float function we used, as the set_option method only influences how the values are printed, but not how they are computed.

As you can see here, a dataframe looks strikingly similar to a file you would expect in a excel sheet. It has a bunch of rows and columns, and in each column there is some information stored.

We can further examine the type and shape (rows and columns) of our dataframe by using the df.info method.

koeln_stats.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8208 entries, 0 to 8207
Columns: 175 entries, S_JAHR to B0026A
dtypes: float64(166), int64(7), object(2)
memory usage: 11.0+ MB

This method gives us information about the number of columns, the datatype (Dtype) for each column and how many entries (rows) there are. This is super helpful to get an idea with what kind of data we are dealing with.

However, the returned datatype (dtypes: float64(166), int64(7), object(2)) refers to the values within each column. Object = Mixed Information (Strings and integers, for example)

The datatype of a column itself is a pandas.Series.

Series

print(f"the datatype of the column RAUM is {type(koeln_stats.RAUM)}")

To access a column of the dataframe, we have two options.

The first one being df.column (replace df with your own dataframe!!). However this only works, if you column name has no spaces or extra characters!

So we could use koeln_stats.RAUM but we can also use koeln_stats["RAUM"]. The output of these operations is equivalent!

koeln_stats.RAUM.head(n=5)

the datatype of the column RAUM is <class 'pandas.core.series.Series'>

koeln_stats["RAUM"].head(n=5)

We can seperately extract the column names with df.columns.

koeln_stats.columns

Index(['S_JAHR', 'S_RAUM', 'RAUM', 'S_RAUMEBENE', 'RAUMEBENE', 'A0002A',
       'A0002P', 'A0022S', 'A0025A', 'A0027A',
       ...
       'H0051S', 'H0052S', 'B0003A', 'B0004A', 'B0009A', 'B0022S', 'B0023S',
       'B0025A', 'B0026P', 'B0026A'],
      dtype='object', length=175)

We can also directly convert them to a list by calling the to_list() method!

columns_list = koeln_stats.columns.to_list()
type(columns_list)

list

And in principle, we can now use this list to get a subset of our dataframe, extracting only the first two columns.

koeln_stats[columns_list[:2]]

	S_JAHR	S_RAUM
0	2012	0
1	2012	1
2	2012	2
3	2012	3
4	2012	4
...	...	...
8203	2023	908030001
8204	2023	909010001
8205	2023	909030001
8206	2023	909030002
8207	2023	909030003

8208 rows × 2 columns

df.describe() can be used to get a numerical overview of all values in our dataframe

koeln_stats.describe()

	S_JAHR	S_RAUM	S_RAUMEBENE	A0002A	A0002P	A0022S	A0025A	A0027A	A0027P	A0029A	...	H0051S	H0052S	B0003A	B0004A	B0009A	B0022S	B0023S	B0025A	B0026P	B0026A
count	8208.000000	8.208000e+03	8208.000000	8196.000000	8196.000000	8196.000000	8.196000e+03	8171.000000	8168.000000	8193.000000	...	8208.000000	8208.000000	1291.000000	1207.000000	1139.000000	1152.000000	1152.000000	1152.000000	1152.000000	8142.000000
mean	2017.500000	4.189774e+08	2.869883	1628.660200	18.953883	42.038425	8.318157e+03	428.981642	5.217621	246.199194	...	92.323871	507.989197	320.107668	147.967688	137.525900	40.370575	82.584000	17469.364583	7.885326	316.908745
std	3.452263	3.057064e+08	0.448253	10005.266938	10.583494	3.765539	5.364302e+04	2808.928659	2.898831	1571.445516	...	9.874286	274.522550	1014.343728	461.114190	426.184214	7.641356	17.648600	58296.342893	10.581390	1923.361828
min	2012.000000	0.000000e+00	0.000000	0.000000	0.000000	0.000000	0.000000e+00	0.000000	0.000000	0.000000	...	61.800000	0.000000	-52.000000	-53.000000	-7.000000	24.738025	59.646470	477.000000	0.000000	0.000000
25%	2014.750000	1.050100e+08	3.000000	213.750000	11.275600	39.719668	1.557000e+03	62.500000	3.409433	42.000000	...	85.754317	363.451500	24.000000	9.000000	7.000000	36.579911	71.708031	3192.750000	1.761331	0.000000
50%	2017.500000	4.010400e+08	3.000000	343.000000	16.527840	42.123683	1.985500e+03	99.000000	4.817014	57.000000	...	90.675546	458.891500	71.000000	26.000000	25.000000	39.630610	81.366647	5510.000000	4.822360	30.000000
75%	2020.250000	7.062975e+08	3.000000	636.000000	23.971053	44.426587	2.547000e+03	164.000000	6.417112	82.000000	...	96.492498	578.956250	217.000000	98.000000	94.000000	44.327964	90.098877	10561.500000	9.593492	137.000000
max	2023.000000	9.090300e+08	4.000000	228555.000000	82.587783	71.926288	1.095520e+06	64063.000000	48.620911	34061.000000	...	149.796886	4307.404000	9912.000000	4689.000000	3957.000000	92.464392	205.228320	572090.000000	89.996014	40972.000000

8 rows × 173 columns

Indexing and masking in pandas dataframes works a bit like in a numpy array. We can also extract multiple columns at once

koeln_stats[["S_RAUM","S_JAHR","S_RAUMEBENE"]]

This is equivalent to the snippet below, since its just lists after all!

column_lists = ["S_RAUM","S_JAHR","S_RAUMEBENE"]

koeln_stats[column_lists]

We can also use df.loc[rows,columns] to index into our dataframe. .loc is used to index into the dataframe based on the labels names. Since rows are usually just numbers, we can pass an integer here. To get the column, we need to pass the column name as a string.

koeln_stats.loc[0,"A0002P"]

17.271948

Exercise#

Show every second row of the first 16 rows of the koeln_stats dataframe for the column S_RAUM.

Hint: Use .loc indexing. You can pass an integer to the row selection and “S_RAUM” to the column selection df.loc[row,column]. Use slicing click here for a reminder to get every 2nd row.

The general syntax for slicing is

      [start:stop:step] -> [index where the slicing starts : index where the slicing stops : Interval between slices]

koeln_stats.loc[:16:2,"S_RAUM"]

To use integers for indexing, we need to use df.iloc. Now, we can simply pass integer values for both rows and columns.

      df.iloc[0,1] -> First row and second column of the dataframe

Exercise#

Define a random integer called ran_col using the numpy.random.randint function. It should not be larger than the number of columns in koeln_stats.

Define a second random integer called ran_rows using the same function. Make sure its not larger than the number of rows in koeln_stats.

Use these two integers to index in to the dataframe using .iloc

# Define random column and row indices
ran_col = np.random.randint(0, koeln_stats.shape[1])  # Random column index (0 to number of columns - 1)
ran_row = np.random.randint(0, koeln_stats.shape[0])  # Random row index (0 to number of rows - 1)

# Use .iloc to index into the dataframe
random_value = koeln_stats.iloc[ran_row, ran_col]

We can also create a mask that is based on boolean values and use it to extract parts of the dataframe, where a given condition is True.

In S_Jahr are the corresponding years stored, when the statistic was collected.

mask = koeln_stats["S_JAHR"] == 2012

koeln_stats[mask]

This is equivalent to

koeln_stats[koeln_stats.S_JAHR == 2012]

Exercise#

Create a new variable called my_vedel. Assign to it a string of the vedel you are living in (if you are comfortable with that, otherwise just use any other.)

We now want to extract the data, that belongs to my_vedel.

Since this information is stored in the RAUM column, we need to build the mask based on that column.

Unfortunately, we cannot use simple logical operator indexing here. So we need to use df.column.str.contains(str).

Create a variable called mask. Use the syntax (df.column.str.contains(str).) from above.

Hint: Replace df with the actual name of the dataframe ,column with the the RAUM column and (str) with the name of your vedel (as a string variable!)

Create a new variable called the name of your vedel. This variable should have only the values of your vedel!

For example, you want something like

      mask = ....

      ehrenfeld = koeln_stats[mask]

my_vedel = "Ehrenfeld"
koeln_stats[koeln_stats.RAUM.str.contains(my_vedel)]

Exercise#

Using the koeln_stats_description dataframe, look up a column code that you think is interesting. Access that column in the your vedel or the koeln_stats dataframe.

Hint: Use koeln_stats_description.head(n=50) to see more column names stored in there.

Hint 2.0: Use this code

      koeln_stats_description[["SCHLUESSEL","INHALT"]].head(n=50).iloc[num_row,1]

to get specific information about a given row!

filtered = koeln_stats[["S_JAHR","S_RAUM","RAUM","S_RAUMEBENE","RAUMEBENE","A0275A","A0315A"]]
filtered.groupby(["RAUM","S_JAHR"])["A0275A"].mean()

With that column in mind you can do some cool investigation. Maybe you are interested in, if the age of people above 80% differes between vedels?

To get an idea of the mean distribution you can use the following syntax

      df.groupby("Column you believe is interesting to sort by")."OutComeColumn".mean()

Try this now with your dataframe.

Hint: If you want to look at the mean across multiple columns you must pass these columns in a list to groupby.

      df.groupby["Column1","Column2"]."OutCome".mean()

filtered.groupby(["RAUM","S_JAHR"])["A0275A"].mean()

We can also infer distribution stats (Mean, Standard Deviation) directly from the column

koeln_stats["A0002P"].mean(),koeln_stats["A0002P"].std()

(18.953883179111735, 10.583493602915325)

Data Visualization and Plotting#

Now that we’ve taken a closer look at our data through basic descriptive statistics and data types, we’ll take the next step by exploring it visually. Basic data visualization offers a different perspective and can reveal key patterns or issues relevant for further analysis. In the following steps, we’ll use various Python libraries and functions to create visualizations that highlight different aspects of the data.

What is Plotting?

Plotting in data science and programming refers to the visual representation of data using charts or graphs. It helps us understand patterns, relationships, and trends in data more clearly and efficiently than raw numbers alone. By turning data into visual formats, such as line graphs, bar charts, histograms, or scatter plots, we can make more informed decisions, identify outliers, and communicate insights to others.

There are two very popular libraries in python, which are almost always used for visualizing data.

That is seaborn and matplotlib.

Matplotlib is actually build based on matlab and is the most used plotting library in python

plot

Seaborn is build on top of matplotlib and is probably the second most popular library to visualize your data in python.

plot

Once more, these libraries have their own commonly used abbreviations. Often times you want to import these libraries like this

import matplotlib.pyplot as plt
import seaborn as sns

sns.displot(data=dataframe, x="Time_to_prepare_sec",hue="Size",kind="kde",multiple="stack")

We can also combine different plots into one figure

fig,axes=plt.subplots(1,3,sharey=True,sharex=True)
sns.lineplot(data=dataframe, y="Time_to_prepare_sec", x = "Drink",errorbar=None,ax=axes[0],palette="deep")
sns.barplot(data=dataframe, y="Time_to_prepare_sec", x = "Drink",errorbar=None,ax=axes[1],palette="deep")
sns.violinplot(data=dataframe, y="Time_to_prepare_sec", x = "Drink",ax=axes[2])
for ax in axes:
    ax.tick_params(axis='x', rotation=90)
plt.tight_layout()
plt.show()

We can also create an interaction plot. This plot might be useful if we assume a difference between our independent variables. For example, does the size of the drink influence the time to prepare it? We indicate this with hue parameter.

sns.lineplot(data=dataframe, y="Time_to_prepare_sec", x="Drink",hue="Size",errorbar=None,palette="deep")
plt.title("Time to Prepare Different Drinks by Size", fontsize=14)
plt.xlabel("Drink Type", fontsize=12)
plt.ylabel("Preparation Time (sec)", fontsize=12)
plt.xticks(rotation=45)  # Rotate x-axis labels
plt.grid(True, linestyle="--", alpha=0.5)
plt.tight_layout()
plt.legend(title="Drink Size")

There are, obviously, many many more plotting styles. You can find all of them here.

Anyways, after this short tutorial on plotting and data visualization, lets return to our koeln dataframe.

Exercise#

Visualize the column you picked from earlier, using the seaborn library. Add xlabels, ylabels and a title to your plot. If you want to, you can choose multiple columns and plot something interactive, using the hue parameter.

For example you might wonder, if the amount of people > 80 years old increased over the time of years in specific vedel?

sns.lineplot(data=filtered,x="S_JAHR",y="A0275A")

plt.title("Anzahl drei Kinder Haushalte in Köln über die Jahre hinweg", fontsize=14)
plt.xlabel("Jahr", fontsize=12)
plt.ylabel("Anzahl an Haushalten", fontsize=12)
plt.xticks(rotation=45)  # Rotate x-axis labels
plt.grid(True, linestyle="--", alpha=0.5)
plt.tight_layout()

Common Data Structures in Python

Contents

Common Data Structures in Python#

Numpy Arrays#

Exercise 15.0#

Exercise#

Exercise#

Exercise#

Exercise#

Exercise#

Pandas#

Exercise#

Exercise#

Exercise#

Exercise#

Data Visualization and Plotting#

Exercise#