Common Data Structures in Python#
There are many different data structures that are used in python. Most prominently used are numpy arrays
, pandas dataframes
and dictionaries
.
In this notebook, we will talk about these and how they are used within the python language
Numpy Arrays#
An array is a datastructure we can use to store numerical
information in. Arrays can be n-dimensional, but typically they are 1- or 2-Dimensional. 2-Dimensional Arrays are very often used to represent images (images are only a combination of 0 (black values) and 255 (white values)). The typical structure of an 2-D Array is something we call a row
and a column
.
This is a 1-Dimensional Array
np.array([0,1,2,3]) (shape: (1 Row, 4 columns) ; Data from 1 dog that evaluated her favorite snack)
This is a 2-Dimensional Array
np.array([[0,1,2,3], (shape: (2 Rows, 4 columns); Data from 2 dogs that evaluated their favorite snacks)
[4,5,6,7]])
What you can see in this example is actually nothing more than calling the np.array()
function and passing a list
[0,1,2,3] to it
Within a numpy array, the stored datatypes must be homogeneous
meaning that they all need to belong to the same data type. Numpy arrays are optimized to be used for numerical computations.
Numpy arrays come with a fixed size
!
We will now start to explore the numpy environment and the numpy arrays.
So first, start by importing numpy as np
import numpy as np
We will not work with real data yet, but rather simulate our own numpy arrays to work with. Numpy offers some really useful functions we can use to generate our arrays.
Lets start with the numpy.random.rand
function.
They key argument we need to pass to the function is the shape
of the array we want to create. Lets start by creating a 1-Dimensional Array first.
The first argument of numpy.random.rand
function determines the number of rows we want our array to have, where as the second argument determines the number of columns.
Exercise 15.0#
Create a 1-D numpy array using the rand
function from the random
module (from the numpy
package). Use it to create a numpy array with 1 Row and 20 Columns.
Assign your array to a variable called “RandomArray”.
RandomArray = np.random.rand(1,20)
The information about the shape
(e.g, how many rows and colum we have) is actually stored within the array element itself. We can access it with array.shape
RandomArray.shape
(1, 20)
We can also create arrays with zeros or ones
np.zeros((1,5))
np.ones((1,5))
Now in numpy, we can use multiple methods to easily extract information
RandomArray.min() #min value
RandomArray.max() #max value
RandomArray.argmin() #index of min value in that array
RandomArray.argmax() #index of max value in that array
Indexing a 1-Dimensional Numpy Array works similiar to slicing
in lists.
1D-Array[0] #element zero
1D-Array[0:10] #elements zero to 10
1D-Array[0:2:10]#elements zero to 10 in steps of 2
Technically, our RandomArray is a 2-D matrix. To access the first 10 columns we need to index like this
SlicedArray = RandomArray[:,:10]
[: tells us, that we want to index into all rows of the array
,:10] tells us, that we want to index into the first 10 columns of the matrix
And we can also use the np.arange
function to create a numpy array with values in a given range
values = np.arange(0,10)
values, values.dtype
(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), dtype('int32'))
Since SlicedArray and values have the same shape, we can perform any mathematical operation with them.
The operations are element wise
. So first element of matrix 1 * first element of matrix 2 and so on.
values * SlicedArray
array([[0. , 0.27416264, 0.65348972, 2.38691933, 0.05090751,
4.33516312, 3.4804216 , 6.213112 , 3.76518539, 7.90406815]])
We can also use logicals to compare and access numpy arrays
values > SlicedArray
array([[False, True, True, True, True, True, True, True, True,
True]])
This returns an array of boolean
values, where the condition is either True
or False
.
We can use this output of boolean values as a mask
. Masks are basically an accelarated version of if element in list, append to another list, else append to another list
thing we practised before.
Disadvantage: This mask performs the comparison element by element.
mask = values > SlicedArray
values.reshape(1,10)[mask]
array([1, 2, 3, 4, 5, 6, 7, 8, 9])
for idx, element in enumerate(values):
if element > SlicedArray[0][idx]:
print(element)
else:
continue
1
2
3
4
5
6
7
8
9
With the np.random.choice
function, we can pass a numpy array and get a back a random set of elements from that array. We can determine the number of iterations with the size
parameter.
But what’s really cool is what happens behind the scenes.
Even though you’re writing Python code, a lot of NumPys operations, like random.choice, are actually powered by code written in C. That’s because Python is a high-level, interpreted language, which means it’s very readable and flexible, but not the fastest when it comes to numerical operations or looping over large datasets.
On the other hand, C is a low-level, compiled language, which means it runs much faster. So to get the best of both worlds — Python’s simplicity and C’s speed — many libraries like NumPy are written in C or Cython (a Python-like language that compiles to C) under the hood, and then “wrapped” in Python. This way, you write code that looks and feels like Python, but it’s executed at C speed behind the scenes.
menu = np.array(["Espresso", "Latte", "Cappuccino", "Americano", "Mocha"])
orders = np.random.choice(menu, size=50)
Under the hood, np.random.choice
is using a C-based loop. Eventhough vanilla python is slower than numpy, we can still show that the logic of our for loops apply here!
Exercise#
Create a function called random_choice. The goal of the function is to return a similar output as np.random.choice
. Since this function uses a loop under the hood, integrate a loop within your function as well. The output should be a numpy array. Try to use the same arguments as in np.random.choice
(array as its input and the size parameter, that defines the the number of iterations).
def random_choice (list,num_iterations):
choice = []
for i in range(num_iterations):
ran = np.random.randint(0,len(list))
element = list[ran]
choice.append(element)
return np.array(choice)
Exercise#
Using the np.arange
function, create one numpy array with values from 1 to 512 in steps of 2 and call the array all_scores.
Create another numpy array with values from 1 to 257 in steps of 1 and call it high_scores.
Your task is to find out, which values from high_scores
are actually in all_scores
.
Create a mask
of boolen values, which should be True if an element of high_scores is in all_scores, and False otherwise. Use the np.isin(firstarray,secondarray)
function to create the mask.
Which array should you pass as the first argument, and which as the second? Use np.isin?
to find out!
Use this mask to create a new array called valid_high_scores
. How do these array differ in their shape and distribution (mean+standard deviation)?
all_scores = np.arange(1,512,2)
high_scores = np.arange(1,257)
mask = np.isin(high_scores,all_scores)
valid_high_scores = high_scores[mask]
Exercise#
Create a numpy array of 9 numbers using np.arange
. Reshape it to a 2D matrix of the shape (3,3).
Hint: Use the np.array.reshape()
method.
np.arange(9).reshape(3,3)
Exercise#
array([[ 1, 2, 3, 4],
[ 5, 6, 7, 8],
[ 9, 10, 11, 12]])
What element woud you expect to see if we index into the array using (array[2,3]) ?
Answer: 12 (Row:3, Column:4)
Exercise#
Create a numpy array with random integers between 1 and 255, with the shape 64,64.
Hint: Use the np.random.randint
function
np.random.randint(1,255,size=(64,64))
(3, 4)
Pandas#
So pandas is the python library you want to use for organizing, manipulating and analyzing your datasets. The standard datatype used in pandas are dataframes
.
Pandas dataframes are basically like an excel sheet, but way, way better.
In this section, we will download a dataset and use this to explore dataframes and apply what we have learned so far.
But first the basic import:
import pandas as pd
This is the way to go. Again, you can use what ever abbreviation you want to, but I dont think I saw any code where some just used pandas.xyz
or pandas as p
or something strange like that.
import pandas as pd
Before transitioning to the dataset, you should know a thing or two about pandas
.
The cool thing here is that you can actually convert lists
or numpy arrays
to a dataframe
.
What you usually want to do is put your lists or arrays in to a dictionary
. A dictionary is a further vanilla python datatype. Its syntax goes like this:
dictionary = {"Participant Number":[0,1,2,3,4],
"Reaction times":[100,50,76,34,95]}
The string input is what we call a key
(Participant Number, Reaction times) The key basically stores the values that are associated with it. We wont focus on dictionaries too much here, but you should know, that a dictionary can store values (or lists of values, or arrays) in so called keys.
What we can now do is create a dataframe
from this dictionary.
'''This code cell created a dictionary called coffee_order_dictionary. It has four keys called Customer_ID, Drink, Size, Time_to_prepare_sec.
The values are created by using the np.arange function, which gives values from (start) to (stop). We randomly create Drink, Size and time to prepare values using the np.random.choice function.
'''
coffee_order_dictionary = {
"Customer_ID": np.arange(1, 51),
"Drink": np.random.choice(
["Latte", "Espresso", "Cappuccino", "Americano", "Mocha"], size=50),
"Size": np.random.choice(["Small", "Medium", "Large"], size=50), #np.random.choice picks one element at a time at random
"Time_to_prepare_sec": np.random.randint(60, 300, size=50)
}
dataframe = pd.DataFrame(coffee_order_dictionary)
So the expected shape of our dataframe should look like this!
We can use the dataframe.head(n=n)
method to display the first n-entries of our dataframe.
dataframe.head(n=10)
Customer_ID | Drink | Size | Time_to_prepare_sec | |
---|---|---|---|---|
0 | 1 | Espresso | Small | 92 |
1 | 2 | Americano | Small | 89 |
2 | 3 | Cappuccino | Large | 198 |
3 | 4 | Cappuccino | Large | 223 |
4 | 5 | Cappuccino | Large | 286 |
5 | 6 | Cappuccino | Large | 102 |
6 | 7 | Latte | Small | 100 |
7 | 8 | Latte | Medium | 271 |
8 | 9 | Mocha | Small | 282 |
9 | 10 | Latte | Small | 257 |
You can again use the ?
here to gather more information about your dataframe.
dataframe?
Type: DataFrame
String form:
Customer_ID Drink Size Time_to_prepare_sec
0 1 Espresso Small <...> Mocha Small 282
9 10 Latte Small 257
Length: 10
File: c:\users\janos\anaconda3\lib\site-packages\pandas\core\frame.py
Docstring:
Two-dimensional, size-mutable, potentially heterogeneous tabular data.
Data structure also contains labeled axes (rows and columns).
Arithmetic operations align on both row and column labels. Can be
thought of as a dict-like container for Series objects. The primary
pandas data structure.
Parameters
----------
data : ndarray (structured or homogeneous), Iterable, dict, or DataFrame
Dict can contain Series, arrays, constants, dataclass or list-like objects. If
data is a dict, column order follows insertion-order. If a dict contains Series
which have an index defined, it is aligned by its index.
.. versionchanged:: 0.25.0
If data is a list of dicts, column order follows insertion-order.
index : Index or array-like
Index to use for resulting frame. Will default to RangeIndex if
no indexing information part of input data and no index provided.
columns : Index or array-like
Column labels to use for resulting frame when data does not have them,
defaulting to RangeIndex(0, 1, 2, ..., n). If data contains column labels,
will perform column selection instead.
dtype : dtype, default None
Data type to force. Only a single dtype is allowed. If None, infer.
copy : bool or None, default None
Copy data from inputs.
For dict data, the default of None behaves like ``copy=True``. For DataFrame
or 2d ndarray input, the default of None behaves like ``copy=False``.
.. versionchanged:: 1.3.0
See Also
--------
DataFrame.from_records : Constructor from tuples, also record arrays.
DataFrame.from_dict : From dicts of Series, arrays, or dicts.
read_csv : Read a comma-separated values (csv) file into DataFrame.
read_table : Read general delimited file into DataFrame.
read_clipboard : Read text from clipboard into DataFrame.
Examples
--------
Constructing DataFrame from a dictionary.
>>> d = {'col1': [1, 2], 'col2': [3, 4]}
>>> df = pd.DataFrame(data=d)
>>> df
col1 col2
0 1 3
1 2 4
Notice that the inferred dtype is int64.
>>> df.dtypes
col1 int64
col2 int64
dtype: object
To enforce a single dtype:
>>> df = pd.DataFrame(data=d, dtype=np.int8)
>>> df.dtypes
col1 int8
col2 int8
dtype: object
Constructing DataFrame from a dictionary including Series:
>>> d = {'col1': [0, 1, 2, 3], 'col2': pd.Series([2, 3], index=[2, 3])}
>>> pd.DataFrame(data=d, index=[0, 1, 2, 3])
col1 col2
0 0 NaN
1 1 NaN
2 2 2.0
3 3 3.0
Constructing DataFrame from numpy ndarray:
>>> df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
... columns=['a', 'b', 'c'])
>>> df2
a b c
0 1 2 3
1 4 5 6
2 7 8 9
Constructing DataFrame from a numpy ndarray that has labeled columns:
>>> data = np.array([(1, 2, 3), (4, 5, 6), (7, 8, 9)],
... dtype=[("a", "i4"), ("b", "i4"), ("c", "i4")])
>>> df3 = pd.DataFrame(data, columns=['c', 'a'])
...
>>> df3
c a
0 3 1
1 6 4
2 9 7
Constructing DataFrame from dataclass:
>>> from dataclasses import make_dataclass
>>> Point = make_dataclass("Point", [("x", int), ("y", int)])
>>> pd.DataFrame([Point(0, 0), Point(0, 3), Point(2, 3)])
x y
0 0 0
1 0 3
2 2 3
You can see, that this object is neatly organized. It has two columns, which correspond to the keys
from our coffee_order_dictionary
. With this dataframe, we now have the opportunity to do many differnt things. But, this would be pretty boring based on this dataframe. So we will load a different one in and check out pandas functionalities based on it.
The code we are using to obtain this data is not in python
but bash
.
!curl url https://raw.githubusercontent.com/JNPauli/IntroductionToPython/refs/heads/main/content/datasets/Stadt_Koeln_Statistischer_Datenkatalog.csv
<html><body><h1>400 Bad request</h1>
Your browser sent an invalid request.
</body></html>
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0curl: (6) Could not resolve host: url
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 90 100 90 0 0 290 0 --:--:-- --:--:-- --:--:-- 294
We now have temporarily download the Stadt_Koeln_Statistischer_Datenkatalog.csv
file in our google colab session.
This also means, that we can now load it into a pandas dataframe. The function we want to use for that is called
pd.read_csv(yourfilename)
We use this function to read the Stadt_Koeln_Statistischer_Datenkatalog.csv
file and store it in a dataframe called koeln_stats
.
Sometimes, when reading in a .csv file, we need to pass the sep
argument. This helps us from preventing that all columns will end up in a single one, thus rendering the dataframe useless. The sep
argument tells pandas, based on what seperator it should read in the columns.
koeln_stats = pd.read_csv(
"https://raw.githubusercontent.com/JNPauli/IntroductionToPython/refs/heads/main/content/datasets/Stadt_Koeln_Statistischer_Datenkatalog.csv",
sep=";")
C:\Users\janos\AppData\Local\Temp\ipykernel_5584\989657012.py:1: DtypeWarning: Columns (9,11,13,15,17,19,23,24,26,28,30,31,33,34,38,42,43,45,47,49,50,52,53,56,62,64,66,67,69,72,77,78,79,82,87,89,91,98,101,102,105,106,107,108,110,113,115,118,119,123,124,125,126,127,128,129,136,137,138,139,140,141,143,145,146,158,159,160,161,167,168,169,170,171,173,174) have mixed types. Specify dtype option on import or set low_memory=False.
koeln_stats = pd.read_csv(
Because the column descriptions of the koeln_stats
dataframe are not informative at all, we also need to download the descriptions df and read it into a pandas dataframe called koeln_stats_description
. This will be useful later on.
koeln_stats_description = pd.read_csv(
"https://raw.githubusercontent.com/JNPauli/IntroductionToPython/refs/heads/main/content/datasets/Beschreibung_Statistischer_Datenkatalog.csv",
sep=";")
We can display our newly obtained by either simply typing and running koeln_stats in a code cell. If you are only interested in viewing the n-th first or last elements of your dataframe, you can use df.head(n=n)
or df.tail(n=n)
, respectively.
koeln_stats.head(n=10)
S_JAHR | S_RAUM | RAUM | S_RAUMEBENE | RAUMEBENE | A0002A | A0002P | A0022S | A0025A | A0027A | ... | H0051S | H0052S | B0003A | B0004A | B0009A | B0022S | B0023S | B0025A | B0026P | B0026A | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2012 | 0 | 0 / Stadt Köln | 0 | Gesamtstadt | 180415.0 | 17,271948 | 41,90013762 | 1044555.0 | 46426 | ... | 87,059374 | 483,255 | 5944 | 2941 | 3114 | 39,40239241 | 75,57050842 | 544630.0 | 7,522905 | 40972 |
1 | 2012 | 1 | 1 / Innenstadt | 1 | Stadtbezirke | 21712.0 | 16,985457 | 40,86903262 | 127827.0 | 4428 | ... | 93,269732 | 458,447 | 566 | 296 | 193 | 40,39537813 | 63,87377692 | 80841.0 | 2,508628 | 2028 |
2 | 2012 | 2 | 2 / Rodenkirchen | 1 | Stadtbezirke | 14788.0 | 14,337793 | 43,45253054 | 103140.0 | 5331 | ... | 91,860767 | 569,468 | 1187 | 450 | 348 | 44,27280396 | 85,89885062 | 53159.0 | 3,397355 | 1806 |
3 | 2012 | 3 | 3 / Lindenthal | 1 | Stadtbezirke | 14132.0 | 9,872231 | 42,06031943 | 143149.0 | 6787 | ... | 91,920257 | 525,04 | 1172 | 689 | 848 | 45,80771085 | 82,08048668 | 79889.0 | 1,126563 | 900 |
4 | 2012 | 4 | 4 / Ehrenfeld | 1 | Stadtbezirke | 19811.0 | 18,779445 | 40,54831047 | 105493.0 | 3935 | ... | 83,775698 | 449,053 | 439 | 365 | 293 | 36,15031329 | 69,38747475 | 54961.0 | 12,57437 | 6911 |
5 | 2012 | 5 | 5 / Nippes | 1 | Stadtbezirke | 20676.0 | 18,145597 | 42,27830898 | 113945.0 | 5154 | ... | 84,574078 | 491,482 | 214 | 52 | 114 | 37,56269253 | 71,26460647 | 60059.0 | 7,341114 | 4409 |
6 | 2012 | 6 | 6 / Chorweiler | 1 | Stadtbezirke | 14892.0 | 18,409049 | 42,06668212 | 80895.0 | 3493 | ... | 84,421093 | 499,165 | 319 | 113 | 189 | 37,27096854 | 87,33156645 | 34524.0 | 23,473525 | 8104 |
7 | 2012 | 7 | 7 / Porz | 1 | Stadtbezirke | 16613.0 | 15,235833 | 43,36416022 | 109039.0 | 5084 | ... | 84,779095 | 553,792 | 453 | 147 | 316 | 39,70796687 | 82,05816466 | 52764.0 | 5,069744 | 2675 |
8 | 2012 | 8 | 8 / Kalk | 1 | Stadtbezirke | 29259.0 | 25,468077 | 40,73158448 | 114885.0 | 5163 | ... | 82,43595 | 377,211 | 915 | 441 | 417 | 35,27169778 | 73,36004852 | 55237.0 | 12,618353 | 6970 |
9 | 2012 | 9 | 9 / Mülheim | 1 | Stadtbezirke | 28532.0 | 19,518135 | 41,96456415 | 146182.0 | 7051 | ... | 81,935198 | 420,681 | 679 | 388 | 396 | 36,80407985 | 73,50256845 | 73196.0 | 9,794251 | 7169 |
10 rows × 175 columns
koeln_stats.tail(n=10)
S_JAHR | S_RAUM | RAUM | S_RAUMEBENE | RAUMEBENE | A0002A | A0002P | A0022S | A0025A | A0027A | ... | H0051S | H0052S | B0003A | B0004A | B0009A | B0022S | B0023S | B0025A | B0026P | B0026A | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
8198 | 2023 | 907030001 | 907030001 / Siedlung Klosterhof | 3 | Statistische Quartiere | 557.0 | 21,514098 | 41,49510751 | 2589.0 | 158 | ... | 91,377 | 442,255 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 50 |
8199 | 2023 | 907060001 | 907060001 / Siedlung Am Donewald | 3 | Statistische Quartiere | 568.0 | 27,559437 | 38,91557496 | 2061.0 | 90 | ... | 94,60204 | 361,475 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 571 |
8200 | 2023 | 908010001 | 908010001 / Stammheim-Nord | 3 | Statistische Quartiere | 337.0 | 21,997389 | 40,79868364 | 1532.0 | 89 | ... | 101,85955 | 1124,02 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 |
8201 | 2023 | 908020001 | 908020001 / Stammheim-Süd - Adolf-Kober-Str. | 3 | Statistische Quartiere | 609.0 | 24,477492 | 43,19138532 | 2488.0 | 220 | ... | 95,658613 | 430,466 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 41 |
8202 | 2023 | 908020002 | 908020002 / Stammheim-Süd - Ricarda-Huch-Str. | 3 | Statistische Quartiere | 399.0 | 26,181102 | 43,38019466 | 1524.0 | 116 | ... | 88,834381 | 348,425 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 199 |
8203 | 2023 | 908030001 | 908030001 / Stammheim | 3 | Statistische Quartiere | 367.0 | 14,864318 | 46,16612664 | 2469.0 | 196 | ... | 92,187551 | 574,726 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 14 |
8204 | 2023 | 909010001 | 909010001 / Flittard | 3 | Statistische Quartiere | 309.0 | 12,112897 | 45,61613093 | 2551.0 | 186 | ... | 93,352664 | 589,964 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 14 |
8205 | 2023 | 909030001 | 909030001 / Bayer-Siedlung - Rungestr. | 3 | Statistische Quartiere | 285.0 | 21,348315 | 43,81292135 | 1335.0 | 86 | ... | 93,488372 | 496,629 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 |
8206 | 2023 | 909030002 | 909030002 / Bayer-Siedlung - Roggendorfstr. | 3 | Statistische Quartiere | 506.0 | 21,754084 | 44,40745916 | 2326.0 | 267 | ... | 96,421722 | 490,541 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 12 |
8207 | 2023 | 909030003 | 909030003 / Bayer-Siedlung - Hufelandstr. | 3 | Statistische Quartiere | 384.0 | 20,210526 | 40,03061404 | 1900.0 | 89 | ... | 97,498829 | 512,631 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 |
10 rows × 175 columns
If we take a closer look at a column like A0002P
, we actually see, that the values look like this:
17,271948
This is very problematic, because Python expects numbers to be seperated by a .
A comma seperated value will be interpreted as a string. Problems like this can always happen, especially when you deal with uncleaned data.
Thats why we need to first define a function, that replaces the ,
with a .
The function converts this replaced value to a float. All other values will be converted to the NaN (Not a Number)
datatype.
def to_german_float(val):
try:
return float(str(val).replace(",", "."))
except:
return np.nan
# Apply to all object-type columns
for col in koeln_stats.select_dtypes(include='object').columns[2:]:
koeln_stats[col] = koeln_stats[col].apply(to_german_float)
koeln_stats
S_JAHR | S_RAUM | RAUM | S_RAUMEBENE | RAUMEBENE | A0002A | A0002P | A0022S | A0025A | A0027A | ... | H0051S | H0052S | B0003A | B0004A | B0009A | B0022S | B0023S | B0025A | B0026P | B0026A | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2012 | 0 | 0 / Stadt Köln | 0 | Gesamtstadt | 180415.0 | 17.271948 | 41.900138 | 1044555.0 | 46426.0 | ... | 87.059374 | 483.255 | 5944.0 | 2941.0 | 3114.0 | 39.402392 | 75.570508 | 544630.0 | 7.522905 | 40972.0 |
1 | 2012 | 1 | 1 / Innenstadt | 1 | Stadtbezirke | 21712.0 | 16.985457 | 40.869033 | 127827.0 | 4428.0 | ... | 93.269732 | 458.447 | 566.0 | 296.0 | 193.0 | 40.395378 | 63.873777 | 80841.0 | 2.508628 | 2028.0 |
2 | 2012 | 2 | 2 / Rodenkirchen | 1 | Stadtbezirke | 14788.0 | 14.337793 | 43.452531 | 103140.0 | 5331.0 | ... | 91.860767 | 569.468 | 1187.0 | 450.0 | 348.0 | 44.272804 | 85.898851 | 53159.0 | 3.397355 | 1806.0 |
3 | 2012 | 3 | 3 / Lindenthal | 1 | Stadtbezirke | 14132.0 | 9.872231 | 42.060319 | 143149.0 | 6787.0 | ... | 91.920257 | 525.040 | 1172.0 | 689.0 | 848.0 | 45.807711 | 82.080487 | 79889.0 | 1.126563 | 900.0 |
4 | 2012 | 4 | 4 / Ehrenfeld | 1 | Stadtbezirke | 19811.0 | 18.779445 | 40.548310 | 105493.0 | 3935.0 | ... | 83.775698 | 449.053 | 439.0 | 365.0 | 293.0 | 36.150313 | 69.387475 | 54961.0 | 12.574370 | 6911.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
8203 | 2023 | 908030001 | 908030001 / Stammheim | 3 | Statistische Quartiere | 367.0 | 14.864318 | 46.166127 | 2469.0 | 196.0 | ... | 92.187551 | 574.726 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 14.0 |
8204 | 2023 | 909010001 | 909010001 / Flittard | 3 | Statistische Quartiere | 309.0 | 12.112897 | 45.616131 | 2551.0 | 186.0 | ... | 93.352664 | 589.964 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 14.0 |
8205 | 2023 | 909030001 | 909030001 / Bayer-Siedlung - Rungestr. | 3 | Statistische Quartiere | 285.0 | 21.348315 | 43.812921 | 1335.0 | 86.0 | ... | 93.488372 | 496.629 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 |
8206 | 2023 | 909030002 | 909030002 / Bayer-Siedlung - Roggendorfstr. | 3 | Statistische Quartiere | 506.0 | 21.754084 | 44.407459 | 2326.0 | 267.0 | ... | 96.421722 | 490.541 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 12.0 |
8207 | 2023 | 909030003 | 909030003 / Bayer-Siedlung - Hufelandstr. | 3 | Statistische Quartiere | 384.0 | 20.210526 | 40.030614 | 1900.0 | 89.0 | ... | 97.498829 | 512.631 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 |
8208 rows × 175 columns
One last thing we need to do:
Some numbers are actually quite large in that dataframe. Pandas tends to use scientific notation to shorten the output, but this makes it hard to interpret at times. So lets change that!
pd.set_option('display.float_format', '{:,.2f}'.format)
This wont influence the to_german_float
function we used, as the set_option
method only influences how the values are printed, but not how they are computed.
As you can see here, a dataframe looks strikingly similar to a file you would expect in a excel sheet. It has a bunch of rows and columns, and in each column there is some information stored.
We can further examine the type and shape (rows and columns) of our dataframe by using the df.info
method.
koeln_stats.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8208 entries, 0 to 8207
Columns: 175 entries, S_JAHR to B0026A
dtypes: float64(166), int64(7), object(2)
memory usage: 11.0+ MB
This method gives us information about the number of columns, the datatype (Dtype) for each column and how many entries (rows) there are. This is super helpful to get an idea with what kind of data we are dealing with.
However, the returned datatype (dtypes: float64(166), int64(7), object(2)
) refers to the values
within each column. Object
= Mixed Information (Strings and integers, for example)
The datatype of a column itself is a pandas.Series
.
print(f"the datatype of the column RAUM is {type(koeln_stats.RAUM)}")
To access a column of the dataframe, we have two options.
The first one being df.column
(replace df with your own dataframe!!). However this only works, if you column name has no spaces or extra characters!
So we could use koeln_stats.RAUM
but we can also use koeln_stats["RAUM"]
. The output of these operations is equivalent!
koeln_stats.RAUM.head(n=5)
the datatype of the column RAUM is <class 'pandas.core.series.Series'>
koeln_stats["RAUM"].head(n=5)
We can seperately extract the column names with df.columns
.
koeln_stats.columns
Index(['S_JAHR', 'S_RAUM', 'RAUM', 'S_RAUMEBENE', 'RAUMEBENE', 'A0002A',
'A0002P', 'A0022S', 'A0025A', 'A0027A',
...
'H0051S', 'H0052S', 'B0003A', 'B0004A', 'B0009A', 'B0022S', 'B0023S',
'B0025A', 'B0026P', 'B0026A'],
dtype='object', length=175)
We can also directly convert them to a list
by calling the to_list()
method!
columns_list = koeln_stats.columns.to_list()
type(columns_list)
list
And in principle, we can now use this list to get a subset of our dataframe, extracting only the first two columns.
koeln_stats[columns_list[:2]]
S_JAHR | S_RAUM | |
---|---|---|
0 | 2012 | 0 |
1 | 2012 | 1 |
2 | 2012 | 2 |
3 | 2012 | 3 |
4 | 2012 | 4 |
... | ... | ... |
8203 | 2023 | 908030001 |
8204 | 2023 | 909010001 |
8205 | 2023 | 909030001 |
8206 | 2023 | 909030002 |
8207 | 2023 | 909030003 |
8208 rows × 2 columns
df.describe() can be used to get a numerical overview of all values in our dataframe
koeln_stats.describe()
S_JAHR | S_RAUM | S_RAUMEBENE | A0002A | A0002P | A0022S | A0025A | A0027A | A0027P | A0029A | ... | H0051S | H0052S | B0003A | B0004A | B0009A | B0022S | B0023S | B0025A | B0026P | B0026A | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 8208.000000 | 8.208000e+03 | 8208.000000 | 8196.000000 | 8196.000000 | 8196.000000 | 8.196000e+03 | 8171.000000 | 8168.000000 | 8193.000000 | ... | 8208.000000 | 8208.000000 | 1291.000000 | 1207.000000 | 1139.000000 | 1152.000000 | 1152.000000 | 1152.000000 | 1152.000000 | 8142.000000 |
mean | 2017.500000 | 4.189774e+08 | 2.869883 | 1628.660200 | 18.953883 | 42.038425 | 8.318157e+03 | 428.981642 | 5.217621 | 246.199194 | ... | 92.323871 | 507.989197 | 320.107668 | 147.967688 | 137.525900 | 40.370575 | 82.584000 | 17469.364583 | 7.885326 | 316.908745 |
std | 3.452263 | 3.057064e+08 | 0.448253 | 10005.266938 | 10.583494 | 3.765539 | 5.364302e+04 | 2808.928659 | 2.898831 | 1571.445516 | ... | 9.874286 | 274.522550 | 1014.343728 | 461.114190 | 426.184214 | 7.641356 | 17.648600 | 58296.342893 | 10.581390 | 1923.361828 |
min | 2012.000000 | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000 | ... | 61.800000 | 0.000000 | -52.000000 | -53.000000 | -7.000000 | 24.738025 | 59.646470 | 477.000000 | 0.000000 | 0.000000 |
25% | 2014.750000 | 1.050100e+08 | 3.000000 | 213.750000 | 11.275600 | 39.719668 | 1.557000e+03 | 62.500000 | 3.409433 | 42.000000 | ... | 85.754317 | 363.451500 | 24.000000 | 9.000000 | 7.000000 | 36.579911 | 71.708031 | 3192.750000 | 1.761331 | 0.000000 |
50% | 2017.500000 | 4.010400e+08 | 3.000000 | 343.000000 | 16.527840 | 42.123683 | 1.985500e+03 | 99.000000 | 4.817014 | 57.000000 | ... | 90.675546 | 458.891500 | 71.000000 | 26.000000 | 25.000000 | 39.630610 | 81.366647 | 5510.000000 | 4.822360 | 30.000000 |
75% | 2020.250000 | 7.062975e+08 | 3.000000 | 636.000000 | 23.971053 | 44.426587 | 2.547000e+03 | 164.000000 | 6.417112 | 82.000000 | ... | 96.492498 | 578.956250 | 217.000000 | 98.000000 | 94.000000 | 44.327964 | 90.098877 | 10561.500000 | 9.593492 | 137.000000 |
max | 2023.000000 | 9.090300e+08 | 4.000000 | 228555.000000 | 82.587783 | 71.926288 | 1.095520e+06 | 64063.000000 | 48.620911 | 34061.000000 | ... | 149.796886 | 4307.404000 | 9912.000000 | 4689.000000 | 3957.000000 | 92.464392 | 205.228320 | 572090.000000 | 89.996014 | 40972.000000 |
8 rows × 173 columns
Indexing and masking in pandas dataframes works a bit like in a numpy array. We can also extract multiple columns at once
koeln_stats[["S_RAUM","S_JAHR","S_RAUMEBENE"]]
This is equivalent to the snippet below, since its just lists after all!
column_lists = ["S_RAUM","S_JAHR","S_RAUMEBENE"]
koeln_stats[column_lists]
We can also use df.loc[rows,columns]
to index into our dataframe. .loc
is used to index into the dataframe based on the labels names. Since rows are usually just numbers, we can pass an integer here.
To get the column, we need to pass the column name as a string.
koeln_stats.loc[0,"A0002P"]
17.271948
Exercise#
Show every second row of the first 16 rows of the koeln_stats
dataframe for the column S_RAUM
.
Hint: Use .loc
indexing. You can pass an integer to the row selection and “S_RAUM” to the column selection df.loc[row,column]
. Use slicing click here for a reminder to get every 2nd row.
The general syntax for slicing is
[start:stop:step] -> [index where the slicing starts : index where the slicing stops : Interval between slices]
koeln_stats.loc[:16:2,"S_RAUM"]
To use integers for indexing, we need to use df.iloc
. Now, we can simply pass integer values for both rows and columns.
df.iloc[0,1] -> First row and second column of the dataframe
Exercise#
Define a random integer called ran_col
using the numpy.random.randint
function. It should not be larger than the number of columns in koeln_stats
.
Define a second random integer called ran_rows
using the same function. Make sure its not larger than the number of rows in koeln_stats
.
Use these two integers to index in to the dataframe using .iloc
# Define random column and row indices
ran_col = np.random.randint(0, koeln_stats.shape[1]) # Random column index (0 to number of columns - 1)
ran_row = np.random.randint(0, koeln_stats.shape[0]) # Random row index (0 to number of rows - 1)
# Use .iloc to index into the dataframe
random_value = koeln_stats.iloc[ran_row, ran_col]
We can also create a mask that is based on boolean values and use it to extract parts of the dataframe, where a given condition is True
.
In S_Jahr
are the corresponding years stored, when the statistic was collected.
mask = koeln_stats["S_JAHR"] == 2012
koeln_stats[mask]
This is equivalent to
koeln_stats[koeln_stats.S_JAHR == 2012]
Exercise#
Create a new variable called my_vedel
. Assign to it a string of the vedel you are living in (if you are comfortable with that, otherwise just use any other.)
We now want to extract the data, that belongs to my_vedel
.
Since this information is stored in the RAUM
column, we need to build the mask based on that column.
Unfortunately, we cannot use simple logical operator indexing
here. So we need to use df.column.str.contains(str).
Create a variable called mask
. Use the syntax (df.column.str.contains(str).
) from above.
Hint: Replace df
with the actual name of the dataframe ,column
with the the RAUM column and (str)
with the name of your vedel (as a string variable!)
Create a new variable called the name of your vedel. This variable should have only the values of your vedel!
For example, you want something like
mask = ....
ehrenfeld = koeln_stats[mask]
my_vedel = "Ehrenfeld"
koeln_stats[koeln_stats.RAUM.str.contains(my_vedel)]
Exercise#
Using the koeln_stats_description
dataframe, look up a column code that you think is interesting. Access that column in the your vedel
or the koeln_stats
dataframe.
Hint: Use koeln_stats_description.head(n=50) to see more column names stored in there.
Hint 2.0: Use this code
koeln_stats_description[["SCHLUESSEL","INHALT"]].head(n=50).iloc[num_row,1]
to get specific information about a given row!
filtered = koeln_stats[["S_JAHR","S_RAUM","RAUM","S_RAUMEBENE","RAUMEBENE","A0275A","A0315A"]]
filtered.groupby(["RAUM","S_JAHR"])["A0275A"].mean()
With that column in mind you can do some cool investigation. Maybe you are interested in, if the age of people above 80% differes between vedels?
To get an idea of the mean distribution
you can use the following syntax
df.groupby("Column you believe is interesting to sort by")."OutComeColumn".mean()
Try this now with your dataframe.
Hint: If you want to look at the mean across multiple columns you must pass these columns in a list to groupby.
df.groupby["Column1","Column2"]."OutCome".mean()
filtered.groupby(["RAUM","S_JAHR"])["A0275A"].mean()
We can also infer distribution stats (Mean, Standard Deviation) directly from the column
koeln_stats["A0002P"].mean(),koeln_stats["A0002P"].std()
(18.953883179111735, 10.583493602915325)
Data Visualization and Plotting#
Now that we’ve taken a closer look at our data through basic descriptive statistics and data types, we’ll take the next step by exploring it visually. Basic data visualization offers a different perspective and can reveal key patterns or issues relevant for further analysis. In the following steps, we’ll use various Python libraries and functions to create visualizations that highlight different aspects of the data.
What is Plotting?
Plotting in data science and programming refers to the visual representation of data using charts or graphs. It helps us understand patterns, relationships, and trends in data more clearly and efficiently than raw numbers alone. By turning data into visual formats, such as line graphs, bar charts, histograms, or scatter plots, we can make more informed decisions, identify outliers, and communicate insights to others.
There are two very popular libraries in python, which are almost always used for visualizing data.
That is seaborn
and matplotlib
.
Matplotlib is actually build based on matlab
and is the most used plotting library in python
Seaborn is build on top of matplotlib
and is probably the second most popular library to visualize your data in python.
Once more, these libraries have their own commonly used abbreviations. Often times you want to import these libraries like this
import matplotlib.pyplot as plt
import seaborn as sns
sns.displot(data=dataframe, x="Time_to_prepare_sec",hue="Size",kind="kde",multiple="stack")
We can also combine different plots into one figure
fig,axes=plt.subplots(1,3,sharey=True,sharex=True)
sns.lineplot(data=dataframe, y="Time_to_prepare_sec", x = "Drink",errorbar=None,ax=axes[0],palette="deep")
sns.barplot(data=dataframe, y="Time_to_prepare_sec", x = "Drink",errorbar=None,ax=axes[1],palette="deep")
sns.violinplot(data=dataframe, y="Time_to_prepare_sec", x = "Drink",ax=axes[2])
for ax in axes:
ax.tick_params(axis='x', rotation=90)
plt.tight_layout()
plt.show()
We can also create an interaction plot. This plot might be useful if we assume a difference between our independent variables. For example, does the size of the drink influence the time to prepare it?
We indicate this with hue
parameter.
sns.lineplot(data=dataframe, y="Time_to_prepare_sec", x="Drink",hue="Size",errorbar=None,palette="deep")
plt.title("Time to Prepare Different Drinks by Size", fontsize=14)
plt.xlabel("Drink Type", fontsize=12)
plt.ylabel("Preparation Time (sec)", fontsize=12)
plt.xticks(rotation=45) # Rotate x-axis labels
plt.grid(True, linestyle="--", alpha=0.5)
plt.tight_layout()
plt.legend(title="Drink Size")
There are, obviously, many many more plotting styles. You can find all of them here.
Anyways, after this short tutorial on plotting and data visualization, lets return to our koeln dataframe.
Exercise#
Visualize the column you picked from earlier, using the seaborn library. Add xlabels, ylabels and a title to your plot. If you want to, you can choose multiple columns and plot something interactive, using the hue
parameter.
For example you might wonder, if the amount of people > 80 years old increased over the time of years in specific vedel?
sns.lineplot(data=filtered,x="S_JAHR",y="A0275A")
plt.title("Anzahl drei Kinder Haushalte in Köln über die Jahre hinweg", fontsize=14)
plt.xlabel("Jahr", fontsize=12)
plt.ylabel("Anzahl an Haushalten", fontsize=12)
plt.xticks(rotation=45) # Rotate x-axis labels
plt.grid(True, linestyle="--", alpha=0.5)
plt.tight_layout()