Common Data Structures in Python#

There are many different data structures that are used in python. Most prominently used are numpy arrays, pandas dataframes and dictionaries.

In this notebook, we will talk about these and how they are used within the python language

Numpy

Numpy Arrays#

An array is a datastructure we can use to store numerical information in. Arrays can be n-dimensional, but typically they are 1- or 2-Dimensional. 2-Dimensional Arrays are very often used to represent images (images are only a combination of 0 (black values) and 255 (white values)). The typical structure of an 2-D Array is something we call a row and a column.

      This is a 1-Dimensional Array
      np.array([0,1,2,3])                     (shape: (1 Row, 4 columns) ; Data from 1 dog that evaluated her favorite snack)

      This is a 2-Dimensional Array
      np.array([[0,1,2,3],                    (shape: (2 Rows, 4 columns); Data from 2 dogs that evaluated their favorite snacks)
                [4,5,6,7]])

What you can see in this example is actually nothing more than calling the np.array() function and passing a list [0,1,2,3] to it

Within a numpy array, the stored datatypes must be homogeneous meaning that they all need to belong to the same data type. Numpy arrays are optimized to be used for numerical computations.

Numpy arrays come with a fixed size!

We will now start to explore the numpy environment and the numpy arrays.

So first, start by importing numpy as np

import numpy as np

We will not work with real data yet, but rather simulate our own numpy arrays to work with. Numpy offers some really useful functions we can use to generate our arrays.

Lets start with the numpy.random.rand function.

They key argument we need to pass to the function is the shape of the array we want to create. Lets start by creating a 1-Dimensional Array first.

The first argument of numpy.random.rand function determines the number of rows we want our array to have, where as the second argument determines the number of columns.

Exercise 15.0#

Create a 1-D numpy array using the rand function from the random module (from the numpy package). Use it to create a numpy array with 1 Row and 20 Columns.

Assign your array to a variable called “RandomArray”.

RandomArray = np.random.rand(1,20)

The information about the shape (e.g, how many rows and colum we have) is actually stored within the array element itself. We can access it with array.shape

RandomArray.shape
(1, 20)

We can also create arrays with zeros or ones

np.zeros((1,5))
np.ones((1,5))

Now in numpy, we can use multiple methods to easily extract information

RandomArray.min() #min value

RandomArray.max() #max value

RandomArray.argmin() #index of min value in that array

RandomArray.argmax() #index of max value in that array

Indexing a 1-Dimensional Numpy Array works similiar to slicing in lists.

      1D-Array[0] #element zero
      1D-Array[0:10] #elements zero to 10
      1D-Array[0:2:10]#elements zero to 10 in steps of 2

Technically, our RandomArray is a 2-D matrix. To access the first 10 columns we need to index like this

SlicedArray = RandomArray[:,:10]
      [: tells us, that we want to index into all rows of the array

      ,:10] tells us, that we want to index into the first 10 columns of the matrix

And we can also use the np.arange function to create a numpy array with values in a given range

values = np.arange(0,10)
values, values.dtype
(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), dtype('int32'))

Since SlicedArray and values have the same shape, we can perform any mathematical operation with them.

The operations are element wise. So first element of matrix 1 * first element of matrix 2 and so on.

values * SlicedArray
array([[0.        , 0.27416264, 0.65348972, 2.38691933, 0.05090751,
        4.33516312, 3.4804216 , 6.213112  , 3.76518539, 7.90406815]])

We can also use logicals to compare and access numpy arrays

values > SlicedArray
array([[False,  True,  True,  True,  True,  True,  True,  True,  True,
         True]])

This returns an array of boolean values, where the condition is either True or False. We can use this output of boolean values as a mask. Masks are basically an accelarated version of if element in list, append to another list, else append to another list thing we practised before.

Disadvantage: This mask performs the comparison element by element.

mask = values > SlicedArray
values.reshape(1,10)[mask]
array([1, 2, 3, 4, 5, 6, 7, 8, 9])
for idx, element in enumerate(values):
          if element > SlicedArray[0][idx]:
                    print(element)
          else:
                    continue
1
2
3
4
5
6
7
8
9

With the np.random.choice function, we can pass a numpy array and get a back a random set of elements from that array. We can determine the number of iterations with the size parameter.

But what’s really cool is what happens behind the scenes.

Even though you’re writing Python code, a lot of NumPys operations, like random.choice, are actually powered by code written in C. That’s because Python is a high-level, interpreted language, which means it’s very readable and flexible, but not the fastest when it comes to numerical operations or looping over large datasets.

On the other hand, C is a low-level, compiled language, which means it runs much faster. So to get the best of both worlds — Python’s simplicity and C’s speed — many libraries like NumPy are written in C or Cython (a Python-like language that compiles to C) under the hood, and then “wrapped” in Python. This way, you write code that looks and feels like Python, but it’s executed at C speed behind the scenes.

menu = np.array(["Espresso", "Latte", "Cappuccino", "Americano", "Mocha"])
orders = np.random.choice(menu, size=50)

Under the hood, np.random.choice is using a C-based loop. Eventhough vanilla python is slower than numpy, we can still show that the logic of our for loops apply here!

Exercise#

Create a function called random_choice. The goal of the function is to return a similar output as np.random.choice. Since this function uses a loop under the hood, integrate a loop within your function as well. The output should be a numpy array. Try to use the same arguments as in np.random.choice (array as its input and the size parameter, that defines the the number of iterations).

def random_choice (list,num_iterations):
  choice = []
  for i in range(num_iterations):
   ran = np.random.randint(0,len(list))
   element = list[ran]
   choice.append(element)

  return np.array(choice)

Exercise#

Using the np.arange function, create one numpy array with values from 1 to 512 in steps of 2 and call the array all_scores.

Create another numpy array with values from 1 to 257 in steps of 1 and call it high_scores.

Your task is to find out, which values from high_scores are actually in all_scores.

Create a mask of boolen values, which should be True if an element of high_scores is in all_scores, and False otherwise. Use the np.isin(firstarray,secondarray) function to create the mask. Which array should you pass as the first argument, and which as the second? Use np.isin? to find out!

Use this mask to create a new array called valid_high_scores. How do these array differ in their shape and distribution (mean+standard deviation)?

all_scores = np.arange(1,512,2)
high_scores = np.arange(1,257)

mask = np.isin(high_scores,all_scores)

valid_high_scores = high_scores[mask]

Exercise#

Create a numpy array of 9 numbers using np.arange. Reshape it to a 2D matrix of the shape (3,3).

Hint: Use the np.array.reshape() method.

np.arange(9).reshape(3,3)

Exercise#

   array([[ 1,  2,  3,  4],
          [ 5,  6,  7,  8],
          [ 9, 10, 11, 12]])

What element woud you expect to see if we index into the array using (array[2,3]) ?

Answer: 12 (Row:3, Column:4)

Exercise#

Create a numpy array with random integers between 1 and 255, with the shape 64,64.

Hint: Use the np.random.randint function

np.random.randint(1,255,size=(64,64))
(3, 4)

pandas

Pandas#

So pandas is the python library you want to use for organizing, manipulating and analyzing your datasets. The standard datatype used in pandas are dataframes. Pandas dataframes are basically like an excel sheet, but way, way better.

In this section, we will download a dataset and use this to explore dataframes and apply what we have learned so far.

But first the basic import:

      import pandas as pd

This is the way to go. Again, you can use what ever abbreviation you want to, but I dont think I saw any code where some just used pandas.xyz or pandas as p or something strange like that.

import pandas as pd

Before transitioning to the dataset, you should know a thing or two about pandas.

The cool thing here is that you can actually convert lists or numpy arrays to a dataframe.

What you usually want to do is put your lists or arrays in to a dictionary. A dictionary is a further vanilla python datatype. Its syntax goes like this:

      dictionary = {"Participant Number":[0,1,2,3,4],
                    "Reaction times":[100,50,76,34,95]}      

The string input is what we call a key (Participant Number, Reaction times) The key basically stores the values that are associated with it. We wont focus on dictionaries too much here, but you should know, that a dictionary can store values (or lists of values, or arrays) in so called keys.

What we can now do is create a dataframe from this dictionary.

'''This code cell created a dictionary called coffee_order_dictionary. It has four keys called Customer_ID, Drink, Size, Time_to_prepare_sec. 
The values are created by using the np.arange function, which gives values from (start) to (stop). We randomly create Drink, Size and time to prepare values using the np.random.choice function. 

'''
coffee_order_dictionary = {
    "Customer_ID": np.arange(1, 51),
    "Drink": np.random.choice(
        ["Latte", "Espresso", "Cappuccino", "Americano", "Mocha"], size=50),
    "Size": np.random.choice(["Small", "Medium", "Large"], size=50), #np.random.choice picks one element at a time at random
    "Time_to_prepare_sec": np.random.randint(60, 300, size=50)
}
dataframe = pd.DataFrame(coffee_order_dictionary)

So the expected shape of our dataframe should look like this!

df

We can use the dataframe.head(n=n) method to display the first n-entries of our dataframe.

dataframe.head(n=10)
Customer_ID Drink Size Time_to_prepare_sec
0 1 Espresso Small 92
1 2 Americano Small 89
2 3 Cappuccino Large 198
3 4 Cappuccino Large 223
4 5 Cappuccino Large 286
5 6 Cappuccino Large 102
6 7 Latte Small 100
7 8 Latte Medium 271
8 9 Mocha Small 282
9 10 Latte Small 257

You can again use the ? here to gather more information about your dataframe.

dataframe?
Type:        DataFrame
String form:
Customer_ID       Drink    Size  Time_to_prepare_sec
           0            1    Espresso   Small        <...>        Mocha   Small                  282
           9           10       Latte   Small                  257
Length:      10
File:        c:\users\janos\anaconda3\lib\site-packages\pandas\core\frame.py
Docstring:  
Two-dimensional, size-mutable, potentially heterogeneous tabular data.

Data structure also contains labeled axes (rows and columns).
Arithmetic operations align on both row and column labels. Can be
thought of as a dict-like container for Series objects. The primary
pandas data structure.

Parameters
----------
data : ndarray (structured or homogeneous), Iterable, dict, or DataFrame
    Dict can contain Series, arrays, constants, dataclass or list-like objects. If
    data is a dict, column order follows insertion-order. If a dict contains Series
    which have an index defined, it is aligned by its index.

    .. versionchanged:: 0.25.0
       If data is a list of dicts, column order follows insertion-order.

index : Index or array-like
    Index to use for resulting frame. Will default to RangeIndex if
    no indexing information part of input data and no index provided.
columns : Index or array-like
    Column labels to use for resulting frame when data does not have them,
    defaulting to RangeIndex(0, 1, 2, ..., n). If data contains column labels,
    will perform column selection instead.
dtype : dtype, default None
    Data type to force. Only a single dtype is allowed. If None, infer.
copy : bool or None, default None
    Copy data from inputs.
    For dict data, the default of None behaves like ``copy=True``.  For DataFrame
    or 2d ndarray input, the default of None behaves like ``copy=False``.

    .. versionchanged:: 1.3.0

See Also
--------
DataFrame.from_records : Constructor from tuples, also record arrays.
DataFrame.from_dict : From dicts of Series, arrays, or dicts.
read_csv : Read a comma-separated values (csv) file into DataFrame.
read_table : Read general delimited file into DataFrame.
read_clipboard : Read text from clipboard into DataFrame.

Examples
--------
Constructing DataFrame from a dictionary.

>>> d = {'col1': [1, 2], 'col2': [3, 4]}
>>> df = pd.DataFrame(data=d)
>>> df
   col1  col2
0     1     3
1     2     4

Notice that the inferred dtype is int64.

>>> df.dtypes
col1    int64
col2    int64
dtype: object

To enforce a single dtype:

>>> df = pd.DataFrame(data=d, dtype=np.int8)
>>> df.dtypes
col1    int8
col2    int8
dtype: object

Constructing DataFrame from a dictionary including Series:

>>> d = {'col1': [0, 1, 2, 3], 'col2': pd.Series([2, 3], index=[2, 3])}
>>> pd.DataFrame(data=d, index=[0, 1, 2, 3])
   col1  col2
0     0   NaN
1     1   NaN
2     2   2.0
3     3   3.0

Constructing DataFrame from numpy ndarray:

>>> df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
...                    columns=['a', 'b', 'c'])
>>> df2
   a  b  c
0  1  2  3
1  4  5  6
2  7  8  9

Constructing DataFrame from a numpy ndarray that has labeled columns:

>>> data = np.array([(1, 2, 3), (4, 5, 6), (7, 8, 9)],
...                 dtype=[("a", "i4"), ("b", "i4"), ("c", "i4")])
>>> df3 = pd.DataFrame(data, columns=['c', 'a'])
...
>>> df3
   c  a
0  3  1
1  6  4
2  9  7

Constructing DataFrame from dataclass:

>>> from dataclasses import make_dataclass
>>> Point = make_dataclass("Point", [("x", int), ("y", int)])
>>> pd.DataFrame([Point(0, 0), Point(0, 3), Point(2, 3)])
   x  y
0  0  0
1  0  3
2  2  3

You can see, that this object is neatly organized. It has two columns, which correspond to the keys from our coffee_order_dictionary. With this dataframe, we now have the opportunity to do many differnt things. But, this would be pretty boring based on this dataframe. So we will load a different one in and check out pandas functionalities based on it.

The code we are using to obtain this data is not in python but bash.

!curl url https://raw.githubusercontent.com/JNPauli/IntroductionToPython/refs/heads/main/content/datasets/Stadt_Koeln_Statistischer_Datenkatalog.csv
<html><body><h1>400 Bad request</h1>
Your browser sent an invalid request.
</body></html>
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0curl: (6) Could not resolve host: url
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100    90  100    90    0     0    290      0 --:--:-- --:--:-- --:--:--   294

We now have temporarily download the Stadt_Koeln_Statistischer_Datenkatalog.csv file in our google colab session. This also means, that we can now load it into a pandas dataframe. The function we want to use for that is called

      pd.read_csv(yourfilename)

We use this function to read the Stadt_Koeln_Statistischer_Datenkatalog.csv file and store it in a dataframe called koeln_stats.

Sometimes, when reading in a .csv file, we need to pass the sep argument. This helps us from preventing that all columns will end up in a single one, thus rendering the dataframe useless. The sep argument tells pandas, based on what seperator it should read in the columns.

koeln_stats = pd.read_csv(
          "https://raw.githubusercontent.com/JNPauli/IntroductionToPython/refs/heads/main/content/datasets/Stadt_Koeln_Statistischer_Datenkatalog.csv",
                    sep=";")
C:\Users\janos\AppData\Local\Temp\ipykernel_5584\989657012.py:1: DtypeWarning: Columns (9,11,13,15,17,19,23,24,26,28,30,31,33,34,38,42,43,45,47,49,50,52,53,56,62,64,66,67,69,72,77,78,79,82,87,89,91,98,101,102,105,106,107,108,110,113,115,118,119,123,124,125,126,127,128,129,136,137,138,139,140,141,143,145,146,158,159,160,161,167,168,169,170,171,173,174) have mixed types. Specify dtype option on import or set low_memory=False.
  koeln_stats = pd.read_csv(

Because the column descriptions of the koeln_stats dataframe are not informative at all, we also need to download the descriptions df and read it into a pandas dataframe called koeln_stats_description. This will be useful later on.

koeln_stats_description = pd.read_csv(
          "https://raw.githubusercontent.com/JNPauli/IntroductionToPython/refs/heads/main/content/datasets/Beschreibung_Statistischer_Datenkatalog.csv",
          sep=";")

We can display our newly obtained by either simply typing and running koeln_stats in a code cell. If you are only interested in viewing the n-th first or last elements of your dataframe, you can use df.head(n=n) or df.tail(n=n), respectively.

koeln_stats.head(n=10)
S_JAHR S_RAUM RAUM S_RAUMEBENE RAUMEBENE A0002A A0002P A0022S A0025A A0027A ... H0051S H0052S B0003A B0004A B0009A B0022S B0023S B0025A B0026P B0026A
0 2012 0 0 / Stadt Köln 0 Gesamtstadt 180415.0 17,271948 41,90013762 1044555.0 46426 ... 87,059374 483,255 5944 2941 3114 39,40239241 75,57050842 544630.0 7,522905 40972
1 2012 1 1 / Innenstadt 1 Stadtbezirke 21712.0 16,985457 40,86903262 127827.0 4428 ... 93,269732 458,447 566 296 193 40,39537813 63,87377692 80841.0 2,508628 2028
2 2012 2 2 / Rodenkirchen 1 Stadtbezirke 14788.0 14,337793 43,45253054 103140.0 5331 ... 91,860767 569,468 1187 450 348 44,27280396 85,89885062 53159.0 3,397355 1806
3 2012 3 3 / Lindenthal 1 Stadtbezirke 14132.0 9,872231 42,06031943 143149.0 6787 ... 91,920257 525,04 1172 689 848 45,80771085 82,08048668 79889.0 1,126563 900
4 2012 4 4 / Ehrenfeld 1 Stadtbezirke 19811.0 18,779445 40,54831047 105493.0 3935 ... 83,775698 449,053 439 365 293 36,15031329 69,38747475 54961.0 12,57437 6911
5 2012 5 5 / Nippes 1 Stadtbezirke 20676.0 18,145597 42,27830898 113945.0 5154 ... 84,574078 491,482 214 52 114 37,56269253 71,26460647 60059.0 7,341114 4409
6 2012 6 6 / Chorweiler 1 Stadtbezirke 14892.0 18,409049 42,06668212 80895.0 3493 ... 84,421093 499,165 319 113 189 37,27096854 87,33156645 34524.0 23,473525 8104
7 2012 7 7 / Porz 1 Stadtbezirke 16613.0 15,235833 43,36416022 109039.0 5084 ... 84,779095 553,792 453 147 316 39,70796687 82,05816466 52764.0 5,069744 2675
8 2012 8 8 / Kalk 1 Stadtbezirke 29259.0 25,468077 40,73158448 114885.0 5163 ... 82,43595 377,211 915 441 417 35,27169778 73,36004852 55237.0 12,618353 6970
9 2012 9 9 / Mülheim 1 Stadtbezirke 28532.0 19,518135 41,96456415 146182.0 7051 ... 81,935198 420,681 679 388 396 36,80407985 73,50256845 73196.0 9,794251 7169

10 rows × 175 columns

koeln_stats.tail(n=10)
S_JAHR S_RAUM RAUM S_RAUMEBENE RAUMEBENE A0002A A0002P A0022S A0025A A0027A ... H0051S H0052S B0003A B0004A B0009A B0022S B0023S B0025A B0026P B0026A
8198 2023 907030001 907030001 / Siedlung Klosterhof 3 Statistische Quartiere 557.0 21,514098 41,49510751 2589.0 158 ... 91,377 442,255 NaN NaN NaN NaN NaN NaN NaN 50
8199 2023 907060001 907060001 / Siedlung Am Donewald 3 Statistische Quartiere 568.0 27,559437 38,91557496 2061.0 90 ... 94,60204 361,475 NaN NaN NaN NaN NaN NaN NaN 571
8200 2023 908010001 908010001 / Stammheim-Nord 3 Statistische Quartiere 337.0 21,997389 40,79868364 1532.0 89 ... 101,85955 1124,02 NaN NaN NaN NaN NaN NaN NaN 0
8201 2023 908020001 908020001 / Stammheim-Süd - Adolf-Kober-Str. 3 Statistische Quartiere 609.0 24,477492 43,19138532 2488.0 220 ... 95,658613 430,466 NaN NaN NaN NaN NaN NaN NaN 41
8202 2023 908020002 908020002 / Stammheim-Süd - Ricarda-Huch-Str. 3 Statistische Quartiere 399.0 26,181102 43,38019466 1524.0 116 ... 88,834381 348,425 NaN NaN NaN NaN NaN NaN NaN 199
8203 2023 908030001 908030001 / Stammheim 3 Statistische Quartiere 367.0 14,864318 46,16612664 2469.0 196 ... 92,187551 574,726 NaN NaN NaN NaN NaN NaN NaN 14
8204 2023 909010001 909010001 / Flittard 3 Statistische Quartiere 309.0 12,112897 45,61613093 2551.0 186 ... 93,352664 589,964 NaN NaN NaN NaN NaN NaN NaN 14
8205 2023 909030001 909030001 / Bayer-Siedlung - Rungestr. 3 Statistische Quartiere 285.0 21,348315 43,81292135 1335.0 86 ... 93,488372 496,629 NaN NaN NaN NaN NaN NaN NaN 0
8206 2023 909030002 909030002 / Bayer-Siedlung - Roggendorfstr. 3 Statistische Quartiere 506.0 21,754084 44,40745916 2326.0 267 ... 96,421722 490,541 NaN NaN NaN NaN NaN NaN NaN 12
8207 2023 909030003 909030003 / Bayer-Siedlung - Hufelandstr. 3 Statistische Quartiere 384.0 20,210526 40,03061404 1900.0 89 ... 97,498829 512,631 NaN NaN NaN NaN NaN NaN NaN 0

10 rows × 175 columns

If we take a closer look at a column like A0002P, we actually see, that the values look like this:

      17,271948

This is very problematic, because Python expects numbers to be seperated by a . A comma seperated value will be interpreted as a string. Problems like this can always happen, especially when you deal with uncleaned data.

Thats why we need to first define a function, that replaces the , with a . The function converts this replaced value to a float. All other values will be converted to the NaN (Not a Number) datatype.

def to_german_float(val):
    try:
        return float(str(val).replace(",", "."))
    except:
        return np.nan

# Apply to all object-type columns
for col in koeln_stats.select_dtypes(include='object').columns[2:]:
    koeln_stats[col] = koeln_stats[col].apply(to_german_float)

koeln_stats
S_JAHR S_RAUM RAUM S_RAUMEBENE RAUMEBENE A0002A A0002P A0022S A0025A A0027A ... H0051S H0052S B0003A B0004A B0009A B0022S B0023S B0025A B0026P B0026A
0 2012 0 0 / Stadt Köln 0 Gesamtstadt 180415.0 17.271948 41.900138 1044555.0 46426.0 ... 87.059374 483.255 5944.0 2941.0 3114.0 39.402392 75.570508 544630.0 7.522905 40972.0
1 2012 1 1 / Innenstadt 1 Stadtbezirke 21712.0 16.985457 40.869033 127827.0 4428.0 ... 93.269732 458.447 566.0 296.0 193.0 40.395378 63.873777 80841.0 2.508628 2028.0
2 2012 2 2 / Rodenkirchen 1 Stadtbezirke 14788.0 14.337793 43.452531 103140.0 5331.0 ... 91.860767 569.468 1187.0 450.0 348.0 44.272804 85.898851 53159.0 3.397355 1806.0
3 2012 3 3 / Lindenthal 1 Stadtbezirke 14132.0 9.872231 42.060319 143149.0 6787.0 ... 91.920257 525.040 1172.0 689.0 848.0 45.807711 82.080487 79889.0 1.126563 900.0
4 2012 4 4 / Ehrenfeld 1 Stadtbezirke 19811.0 18.779445 40.548310 105493.0 3935.0 ... 83.775698 449.053 439.0 365.0 293.0 36.150313 69.387475 54961.0 12.574370 6911.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
8203 2023 908030001 908030001 / Stammheim 3 Statistische Quartiere 367.0 14.864318 46.166127 2469.0 196.0 ... 92.187551 574.726 NaN NaN NaN NaN NaN NaN NaN 14.0
8204 2023 909010001 909010001 / Flittard 3 Statistische Quartiere 309.0 12.112897 45.616131 2551.0 186.0 ... 93.352664 589.964 NaN NaN NaN NaN NaN NaN NaN 14.0
8205 2023 909030001 909030001 / Bayer-Siedlung - Rungestr. 3 Statistische Quartiere 285.0 21.348315 43.812921 1335.0 86.0 ... 93.488372 496.629 NaN NaN NaN NaN NaN NaN NaN 0.0
8206 2023 909030002 909030002 / Bayer-Siedlung - Roggendorfstr. 3 Statistische Quartiere 506.0 21.754084 44.407459 2326.0 267.0 ... 96.421722 490.541 NaN NaN NaN NaN NaN NaN NaN 12.0
8207 2023 909030003 909030003 / Bayer-Siedlung - Hufelandstr. 3 Statistische Quartiere 384.0 20.210526 40.030614 1900.0 89.0 ... 97.498829 512.631 NaN NaN NaN NaN NaN NaN NaN 0.0

8208 rows × 175 columns

One last thing we need to do:

Some numbers are actually quite large in that dataframe. Pandas tends to use scientific notation to shorten the output, but this makes it hard to interpret at times. So lets change that!

pd.set_option('display.float_format', '{:,.2f}'.format)

This wont influence the to_german_float function we used, as the set_option method only influences how the values are printed, but not how they are computed.

As you can see here, a dataframe looks strikingly similar to a file you would expect in a excel sheet. It has a bunch of rows and columns, and in each column there is some information stored.

We can further examine the type and shape (rows and columns) of our dataframe by using the df.info method.

koeln_stats.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8208 entries, 0 to 8207
Columns: 175 entries, S_JAHR to B0026A
dtypes: float64(166), int64(7), object(2)
memory usage: 11.0+ MB

This method gives us information about the number of columns, the datatype (Dtype) for each column and how many entries (rows) there are. This is super helpful to get an idea with what kind of data we are dealing with.

However, the returned datatype (dtypes: float64(166), int64(7), object(2)) refers to the values within each column. Object = Mixed Information (Strings and integers, for example)

The datatype of a column itself is a pandas.Series.

Series

print(f"the datatype of the column RAUM is {type(koeln_stats.RAUM)}")

To access a column of the dataframe, we have two options.

The first one being df.column (replace df with your own dataframe!!). However this only works, if you column name has no spaces or extra characters!

So we could use koeln_stats.RAUM but we can also use koeln_stats["RAUM"]. The output of these operations is equivalent!

koeln_stats.RAUM.head(n=5)
the datatype of the column RAUM is <class 'pandas.core.series.Series'>
koeln_stats["RAUM"].head(n=5)

We can seperately extract the column names with df.columns.

koeln_stats.columns
Index(['S_JAHR', 'S_RAUM', 'RAUM', 'S_RAUMEBENE', 'RAUMEBENE', 'A0002A',
       'A0002P', 'A0022S', 'A0025A', 'A0027A',
       ...
       'H0051S', 'H0052S', 'B0003A', 'B0004A', 'B0009A', 'B0022S', 'B0023S',
       'B0025A', 'B0026P', 'B0026A'],
      dtype='object', length=175)

We can also directly convert them to a list by calling the to_list() method!

columns_list = koeln_stats.columns.to_list()
type(columns_list)
list

And in principle, we can now use this list to get a subset of our dataframe, extracting only the first two columns.

koeln_stats[columns_list[:2]]
S_JAHR S_RAUM
0 2012 0
1 2012 1
2 2012 2
3 2012 3
4 2012 4
... ... ...
8203 2023 908030001
8204 2023 909010001
8205 2023 909030001
8206 2023 909030002
8207 2023 909030003

8208 rows × 2 columns

df.describe() can be used to get a numerical overview of all values in our dataframe

koeln_stats.describe()
S_JAHR S_RAUM S_RAUMEBENE A0002A A0002P A0022S A0025A A0027A A0027P A0029A ... H0051S H0052S B0003A B0004A B0009A B0022S B0023S B0025A B0026P B0026A
count 8208.000000 8.208000e+03 8208.000000 8196.000000 8196.000000 8196.000000 8.196000e+03 8171.000000 8168.000000 8193.000000 ... 8208.000000 8208.000000 1291.000000 1207.000000 1139.000000 1152.000000 1152.000000 1152.000000 1152.000000 8142.000000
mean 2017.500000 4.189774e+08 2.869883 1628.660200 18.953883 42.038425 8.318157e+03 428.981642 5.217621 246.199194 ... 92.323871 507.989197 320.107668 147.967688 137.525900 40.370575 82.584000 17469.364583 7.885326 316.908745
std 3.452263 3.057064e+08 0.448253 10005.266938 10.583494 3.765539 5.364302e+04 2808.928659 2.898831 1571.445516 ... 9.874286 274.522550 1014.343728 461.114190 426.184214 7.641356 17.648600 58296.342893 10.581390 1923.361828
min 2012.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 ... 61.800000 0.000000 -52.000000 -53.000000 -7.000000 24.738025 59.646470 477.000000 0.000000 0.000000
25% 2014.750000 1.050100e+08 3.000000 213.750000 11.275600 39.719668 1.557000e+03 62.500000 3.409433 42.000000 ... 85.754317 363.451500 24.000000 9.000000 7.000000 36.579911 71.708031 3192.750000 1.761331 0.000000
50% 2017.500000 4.010400e+08 3.000000 343.000000 16.527840 42.123683 1.985500e+03 99.000000 4.817014 57.000000 ... 90.675546 458.891500 71.000000 26.000000 25.000000 39.630610 81.366647 5510.000000 4.822360 30.000000
75% 2020.250000 7.062975e+08 3.000000 636.000000 23.971053 44.426587 2.547000e+03 164.000000 6.417112 82.000000 ... 96.492498 578.956250 217.000000 98.000000 94.000000 44.327964 90.098877 10561.500000 9.593492 137.000000
max 2023.000000 9.090300e+08 4.000000 228555.000000 82.587783 71.926288 1.095520e+06 64063.000000 48.620911 34061.000000 ... 149.796886 4307.404000 9912.000000 4689.000000 3957.000000 92.464392 205.228320 572090.000000 89.996014 40972.000000

8 rows × 173 columns

Indexing and masking in pandas dataframes works a bit like in a numpy array. We can also extract multiple columns at once

koeln_stats[["S_RAUM","S_JAHR","S_RAUMEBENE"]]

This is equivalent to the snippet below, since its just lists after all!

column_lists = ["S_RAUM","S_JAHR","S_RAUMEBENE"]

koeln_stats[column_lists]

We can also use df.loc[rows,columns] to index into our dataframe. .loc is used to index into the dataframe based on the labels names. Since rows are usually just numbers, we can pass an integer here. To get the column, we need to pass the column name as a string.

koeln_stats.loc[0,"A0002P"]
17.271948

Exercise#

Show every second row of the first 16 rows of the koeln_stats dataframe for the column S_RAUM.

Hint: Use .loc indexing. You can pass an integer to the row selection and “S_RAUM” to the column selection df.loc[row,column]. Use slicing click here for a reminder to get every 2nd row.

The general syntax for slicing is

      [start:stop:step] -> [index where the slicing starts : index where the slicing stops : Interval between slices]
koeln_stats.loc[:16:2,"S_RAUM"]

To use integers for indexing, we need to use df.iloc. Now, we can simply pass integer values for both rows and columns.

      df.iloc[0,1] -> First row and second column of the dataframe

Exercise#

Define a random integer called ran_col using the numpy.random.randint function. It should not be larger than the number of columns in koeln_stats.

Define a second random integer called ran_rows using the same function. Make sure its not larger than the number of rows in koeln_stats.

Use these two integers to index in to the dataframe using .iloc

# Define random column and row indices
ran_col = np.random.randint(0, koeln_stats.shape[1])  # Random column index (0 to number of columns - 1)
ran_row = np.random.randint(0, koeln_stats.shape[0])  # Random row index (0 to number of rows - 1)

# Use .iloc to index into the dataframe
random_value = koeln_stats.iloc[ran_row, ran_col]

We can also create a mask that is based on boolean values and use it to extract parts of the dataframe, where a given condition is True.

In S_Jahr are the corresponding years stored, when the statistic was collected.

mask = koeln_stats["S_JAHR"] == 2012

koeln_stats[mask]

This is equivalent to

koeln_stats[koeln_stats.S_JAHR == 2012]

Exercise#

Create a new variable called my_vedel. Assign to it a string of the vedel you are living in (if you are comfortable with that, otherwise just use any other.)

We now want to extract the data, that belongs to my_vedel.

Since this information is stored in the RAUM column, we need to build the mask based on that column.

Unfortunately, we cannot use simple logical operator indexing here. So we need to use df.column.str.contains(str).

Create a variable called mask. Use the syntax (df.column.str.contains(str).) from above.

Hint: Replace df with the actual name of the dataframe ,column with the the RAUM column and (str) with the name of your vedel (as a string variable!)

Create a new variable called the name of your vedel. This variable should have only the values of your vedel!

For example, you want something like

      mask = ....

      ehrenfeld = koeln_stats[mask]
my_vedel = "Ehrenfeld"
koeln_stats[koeln_stats.RAUM.str.contains(my_vedel)]

Exercise#

Using the koeln_stats_description dataframe, look up a column code that you think is interesting. Access that column in the your vedel or the koeln_stats dataframe.

Hint: Use koeln_stats_description.head(n=50) to see more column names stored in there.

Hint 2.0: Use this code

      koeln_stats_description[["SCHLUESSEL","INHALT"]].head(n=50).iloc[num_row,1]

to get specific information about a given row!

filtered = koeln_stats[["S_JAHR","S_RAUM","RAUM","S_RAUMEBENE","RAUMEBENE","A0275A","A0315A"]]
filtered.groupby(["RAUM","S_JAHR"])["A0275A"].mean()

With that column in mind you can do some cool investigation. Maybe you are interested in, if the age of people above 80% differes between vedels?

To get an idea of the mean distribution you can use the following syntax

      df.groupby("Column you believe is interesting to sort by")."OutComeColumn".mean()

Try this now with your dataframe.

Hint: If you want to look at the mean across multiple columns you must pass these columns in a list to groupby.

      df.groupby["Column1","Column2"]."OutCome".mean()
filtered.groupby(["RAUM","S_JAHR"])["A0275A"].mean()

We can also infer distribution stats (Mean, Standard Deviation) directly from the column

koeln_stats["A0002P"].mean(),koeln_stats["A0002P"].std()
(18.953883179111735, 10.583493602915325)

Data Visualization and Plotting#

Now that we’ve taken a closer look at our data through basic descriptive statistics and data types, we’ll take the next step by exploring it visually. Basic data visualization offers a different perspective and can reveal key patterns or issues relevant for further analysis. In the following steps, we’ll use various Python libraries and functions to create visualizations that highlight different aspects of the data.

What is Plotting?

Plotting in data science and programming refers to the visual representation of data using charts or graphs. It helps us understand patterns, relationships, and trends in data more clearly and efficiently than raw numbers alone. By turning data into visual formats, such as line graphs, bar charts, histograms, or scatter plots, we can make more informed decisions, identify outliers, and communicate insights to others.

There are two very popular libraries in python, which are almost always used for visualizing data.

That is seaborn and matplotlib.

Matplotlib is actually build based on matlab and is the most used plotting library in python

plot

Seaborn is build on top of matplotlib and is probably the second most popular library to visualize your data in python.

plot

Once more, these libraries have their own commonly used abbreviations. Often times you want to import these libraries like this

import matplotlib.pyplot as plt
import seaborn as sns
sns.displot(data=dataframe, x="Time_to_prepare_sec",hue="Size",kind="kde",multiple="stack")

We can also combine different plots into one figure

fig,axes=plt.subplots(1,3,sharey=True,sharex=True)
sns.lineplot(data=dataframe, y="Time_to_prepare_sec", x = "Drink",errorbar=None,ax=axes[0],palette="deep")
sns.barplot(data=dataframe, y="Time_to_prepare_sec", x = "Drink",errorbar=None,ax=axes[1],palette="deep")
sns.violinplot(data=dataframe, y="Time_to_prepare_sec", x = "Drink",ax=axes[2])
for ax in axes:
    ax.tick_params(axis='x', rotation=90)
plt.tight_layout()
plt.show()

We can also create an interaction plot. This plot might be useful if we assume a difference between our independent variables. For example, does the size of the drink influence the time to prepare it? We indicate this with hue parameter.

sns.lineplot(data=dataframe, y="Time_to_prepare_sec", x="Drink",hue="Size",errorbar=None,palette="deep")
plt.title("Time to Prepare Different Drinks by Size", fontsize=14)
plt.xlabel("Drink Type", fontsize=12)
plt.ylabel("Preparation Time (sec)", fontsize=12)
plt.xticks(rotation=45)  # Rotate x-axis labels
plt.grid(True, linestyle="--", alpha=0.5)
plt.tight_layout()
plt.legend(title="Drink Size")

There are, obviously, many many more plotting styles. You can find all of them here.

Anyways, after this short tutorial on plotting and data visualization, lets return to our koeln dataframe.

Exercise#

Visualize the column you picked from earlier, using the seaborn library. Add xlabels, ylabels and a title to your plot. If you want to, you can choose multiple columns and plot something interactive, using the hue parameter.

For example you might wonder, if the amount of people > 80 years old increased over the time of years in specific vedel?

sns.lineplot(data=filtered,x="S_JAHR",y="A0275A")

plt.title("Anzahl drei Kinder Haushalte in Köln über die Jahre hinweg", fontsize=14)
plt.xlabel("Jahr", fontsize=12)
plt.ylabel("Anzahl an Haushalten", fontsize=12)
plt.xticks(rotation=45)  # Rotate x-axis labels
plt.grid(True, linestyle="--", alpha=0.5)
plt.tight_layout()