Pandas and DataFrames

In this lesson we will be exploring data analysis using Pandas.

  • College Board talks about ideas like
    • Tools. "the ability to process data depends on users capabilities and their tools"
    • Combining Data. "combine county data sets"
    • Status on Data"determining the artist with the greatest attendance during a particular month"
    • Data poses challenge. "the need to clean data", "incomplete data"
  • From Pandas Overview -- When working with tabular data, such as data stored in spreadsheets or databases, pandas is the right tool for you. pandas will help you to explore, clean, and process your data. In pandas, a data table is called a DataFrame.

DataFrame

'''Pandas is used to gather data sets through its DataFrames implementation'''
import pandas as pd

Cleaning Data

When looking at a data set, check to see what data needs to be cleaned. Examples include:

  • Missing Data Points
  • Invalid Data
  • Inaccurate Data

Run the following code to see what needs to be cleaned

df = pd.read_json('files/grade.json')

print(df)
# What part of the data set needs to be cleaned?
# From PBL learning, what is a good time to clean data?  Hint, remember Garbage in, Garbage out?
   Student ID Year in School   GPA
0         123             12  3.57
1         246             10  4.00
2         578             12  2.78
3         469             11  3.45
4         324         Junior  4.75
5         313             20  3.33
6         145             12  2.95
7         167             10  3.90
8         235      9th Grade  3.15
9         nil              9  2.80
10        469             11  3.45
11        456             10  2.75

Extracting Info

Take a look at some features that the Pandas library has that extracts info from the dataset

DataFrame Extract Column

print(df[['GPA']])

print()

#try two columns and remove the index from print statement
print(df[['Student ID','GPA']].to_string(index=False))
     GPA
0   3.57
1   4.00
2   2.78
3   3.45
4   4.75
5   3.33
6   2.95
7   3.90
8   3.15
9   2.80
10  3.45
11  2.75

Student ID  GPA
       123 3.57
       246 4.00
       578 2.78
       469 3.45
       324 4.75
       313 3.33
       145 2.95
       167 3.90
       235 3.15
       nil 2.80
       469 3.45
       456 2.75

DataFrame Sort

print(df.sort_values(by=['GPA']))

print()

#sort the values in reverse order
print(df.sort_values(by=['GPA'], ascending=False))
   Student ID Year in School   GPA
11        456             10  2.75
2         578             12  2.78
9         nil              9  2.80
6         145             12  2.95
8         235      9th Grade  3.15
5         313             20  3.33
3         469             11  3.45
10        469             11  3.45
0         123             12  3.57
7         167             10  3.90
1         246             10  4.00
4         324         Junior  4.75

   Student ID Year in School   GPA
4         324         Junior  4.75
1         246             10  4.00
7         167             10  3.90
0         123             12  3.57
3         469             11  3.45
10        469             11  3.45
5         313             20  3.33
8         235      9th Grade  3.15
6         145             12  2.95
9         nil              9  2.80
2         578             12  2.78
11        456             10  2.75

DataFrame Selection or Filter

print(df[df.GPA > 3.00])
   Student ID Year in School   GPA
0         123             12  3.57
1         246             10  4.00
3         469             11  3.45
4         324         Junior  4.75
5         313             20  3.33
7         167             10  3.90
8         235      9th Grade  3.15
10        469             11  3.45

DataFrame Selection Max and Min

print(df[df.GPA == df.GPA.max()])
print()
print(df[df.GPA == df.GPA.min()])
  Student ID Year in School   GPA
4        324         Junior  4.75

   Student ID Year in School   GPA
11        456             10  2.75

Create your own DataFrame

Using Pandas allows you to create your own DataFrame in Python.

Python Dictionary to Pandas DataFrame

import pandas as pd

#the data can be stored as a python dictionary
dict = {
  "calories": [420, 380, 390, 50000],
  "duration": [50, 40, 45, 200]
}
#stores the data in a data frame
print("-------------Dict_to_DF------------------")
df = pd.DataFrame(dict)
print(df)

print("----------Dict_to_DF_labels--------------")

#or with the index argument, you can label rows.
df = pd.DataFrame(dict, index = ["day1", "day2", "day3", "day4"])
print(df)
-------------Dict_to_DF------------------
   calories  duration
0       420        50
1       380        40
2       390        45
3     50000       200
----------Dict_to_DF_labels--------------
      calories  duration
day1       420        50
day2       380        40
day3       390        45
day4     50000       200

Examine DataFrame Rows

print("-------Examine Selected Rows---------")
#use a list for multiple labels:
print(df.loc[["day1", "day4"]])

#refer to the row index:
print("--------Examine Single Row-----------")
print(df.loc["day4"])
-------Examine Selected Rows---------
      calories  duration
day1       420        50
day4     50000       200
--------Examine Single Row-----------
calories    50000
duration      200
Name: day4, dtype: int64

Pandas DataFrame Information

print(df.info())
<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, day1 to day3
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   calories  3 non-null      int64
 1   duration  3 non-null      int64
dtypes: int64(2)
memory usage: 180.0+ bytes
None

Example of larger data set

Pandas can read CSV and many other types of files, run the following code to see more features with a larger data set

import pandas as pd

#read csv and sort 'Duration' largest to smallest
df = pd.read_csv('files/data.csv').sort_values(by=['Duration'], ascending=False)

print("--Duration Top 10---------")
print(df.head(10))

print("--Duration Bottom 10------")
print(df.tail(10))
--Duration Top 10---------
     Duration  Pulse  Maxpulse  Calories
69        300    108       143    1500.2
79        270    100       131    1729.0
109       210    137       184    1860.4
60        210    108       160    1376.0
106       180     90       120     800.3
90        180    101       127     600.1
65        180     90       130     800.4
61        160    110       137    1034.4
62        160    109       135     853.0
67        150    107       130     816.0
--Duration Bottom 10------
     Duration  Pulse  Maxpulse  Calories
68         20    106       136     110.4
100        20     95       112      77.7
89         20     83       107      50.3
135        20    136       156     189.0
94         20    150       171     127.4
95         20    151       168     229.4
139        20    141       162     222.4
64         20    110       130     131.4
112        15    124       139     124.2
93         15     80       100      50.5

APIs are a Source for Writing Programs with Data

3rd Party APIs are a great source for creating Pandas Data Frames.

  • Data can be fetched and resulting json can be placed into a Data Frame
  • Observe output, this looks very similar to a Database
'''Pandas can be used to analyze data'''
import pandas as pd
import requests

def fetch():
    '''Obtain data from an endpoint'''
    url = "https://flask.nighthawkcodingsociety.com/api/covid/"
    fetch = requests.get(url)
    json = fetch.json()

    # filter data for requirement
    df = pd.DataFrame(json['countries_stat'])  # filter endpoint for country stats
    print(df.loc[0:5, 'country_name':'deaths']) # show row 0 through 5 and columns country_name through deaths
    
fetch()
  country_name       cases     deaths
0          USA  82,649,779  1,018,316
1        India  43,057,545    522,193
2       Brazil  30,345,654    662,663
3       France  28,244,977    145,020
4      Germany  24,109,433    134,624
5           UK  21,933,206    173,352

Hacks

Early Seed award

  • Add this Blog to you own Blogging site.
  • Have all lecture files saved to your files directory before Tech Talk starts. Have data.csv open in vscode. Don't tell anyone. Show to Teacher.

AP Prep

  • Add this Blog to you own Blogging site. In the Blog add notes and observations on each code cell.
  • In blog add College Board practice problems for 2.3.

The next 4 weeks, Teachers want you to improve your understanding of data. Look at the blog and others on Unit 2. Your intention is to find some things to differentiate your individual College Board project.

  • Create or Find your own dataset. The suggestion is to use a JSON file, integrating with your PBL project would be Amazing.

  • When choosing a data set, think about the following...

    • Does it have a good sample size?
    • Is there bias in the data?
    • Does the data set need to be cleaned?
    • What is the purpose of the data set?
    • ...
  • Continue this Blog using Pandas extract info from that dataset (ex. max, min, mean, median, mode, etc.)

Hack Helpers

Here is how Mort started on this assignment by asking ChatGPT ... Regarding Python Pandas, what are some data sets that would be good for learning Pandas?

  • There are many data sets that are suitable for learning pandas, depending on your interests and the skills you want to develop.

    Here are some suggestions...

    • Titanic Dataset:This is a classic dataset for data analysis and machine learning, and is often used as an introduction to pandas. The dataset contains information about passengers on the Titanic, including their demographics, ticket class, and survival status.
    • Iris Dataset: This is another classic dataset that is often used in machine learning and data analysis courses. The dataset contains information about iris flowers, including their sepal length, sepal width, petal length, and petal width.

    • NBA Player Stats: This dataset contains information about NBA players, including their stats for various games, their positions, and their salaries. It is a great dataset for learning how to clean and manipulate data using pandas.

    • Housing Prices Dataset: This dataset contains information about housing prices in a particular city or region, including variables such as the number of bedrooms, square footage, and price. It is a great dataset for learning how to manipulate and analyze numerical data using pandas.

    • COVID-19 Dataset: This dataset contains information about the COVID-19 pandemic, including the number of cases, deaths, and recoveries in various countries and regions. It is a great dataset for learning how to manipulate and analyze time-series data using pandas.

    • World Bank Dataset: This dataset contains information about economic indicators from various countries, such as GDP, population, and poverty rates. It is a great dataset for learning how to manipulate and analyze large datasets using pandas.

  • These are just a few examples of the many datasets that are suitable for learning pandas. The best dataset for you will depend on your interests and the specific skills you want to develop.

    Follow up question, "where can I find Titanic data set?"

  • The Titanic dataset is a popular dataset for data analysis and machine learning, and it can be found on various websites and data repositories. Here are a few places where you can find the Titanic dataset...

    • Kaggle:The Titanic dataset is available on Kaggle, which is a popular platform for data scientists and machine learning enthusiasts. You can download the dataset from the Kaggle website after creating an account.
    • UCI Machine Learning Repository: The Titanic dataset is also available on the UCI Machine Learning Repository, which is a collection of datasets that are commonly used for machine learning research. You can download the dataset from the UCI Machine Learning Repository website.

    • Seaborn library: If you have the Seaborn library installed in your Python environment, you can load the Titanic dataset directly from the library using the following code:

      import seaborn as sns
      titanic_data = sns.load_dataset('titanic')
      

Hacks Collegeboard

I got One wrong question in the Extracting Information from Data Quiz:

A is correct because if you divide the total dollar amount of tickets sold by the average ticket price, you can get the attendance for the concert.

Hacks Code(Lottery Data Set)

import pandas as pd

#read csv and sort 'Multiplier' largest to smallest
data = pd.read_csv('files/Lottery_Powerball_Winning_Numbers__Beginning_2010.csv').sort_values(by=['Multiplier'], ascending=False)

# Clean all the values with no multipliers
data.dropna(inplace = True)

print("--10 Largest multipliers---------")
print(data.head(10))

print("--10 Smallest multipliers------")
print(data.tail(10))
--10 Largest multipliers---------
       Draw Date    Winning Numbers  Multiplier
1261  01/08/2022  20 21 36 60 65 13        10.0
1132  06/12/2010  09 12 13 35 38 30        10.0
42    02/20/2021  04 08 22 32 58 04        10.0
432   03/01/2017  10 16 40 52 55 17        10.0
173   08/28/2019  09 32 37 41 56 14        10.0
1396  11/26/2022  15 30 47 50 51 03        10.0
1404  12/14/2022  36 51 59 66 68 25        10.0
71    08/19/2020  13 23 47 55 58 23        10.0
337   01/31/2018  04 07 14 46 59 22        10.0
171   09/04/2019  04 08 30 52 59 02        10.0
--10 Smallest multipliers------
      Draw Date    Winning Numbers  Multiplier
733  04/09/2014  09 14 44 48 49 29         2.0
309  05/09/2018  11 16 38 50 69 19         2.0
735  04/02/2014  08 13 19 22 53 24         2.0
736  03/29/2014  02 03 12 27 38 17         2.0
737  03/26/2014  28 33 41 44 59 21         2.0
738  03/22/2014  13 28 31 55 58 15         2.0
739  03/19/2014  02 19 23 34 43 14         2.0
741  03/12/2014  14 15 28 37 54 10         2.0
742  03/08/2014  10 14 24 32 41 30         2.0
615  05/30/2015  08 09 25 56 57 22         2.0

Find Max and Min:

import pandas as pd

#read csv and sort 'Multiplier' largest to smallest
data = pd.read_csv('files/Lottery_Powerball_Winning_Numbers__Beginning_2010.csv').sort_values(by=['Multiplier'], ascending=False)

# Clean all the values with no multipliers
data.dropna(inplace = True)

print("--Largest multiplier---------")
print(data.head(1))

print("--Smallest multiplier------")
print(data.tail(1))
--Largest multiplier---------
       Draw Date    Winning Numbers  Multiplier
1261  01/08/2022  20 21 36 60 65 13        10.0
--Smallest multiplier------
      Draw Date    Winning Numbers  Multiplier
615  05/30/2015  08 09 25 56 57 22         2.0

Find a Specific Date:

import pandas as pd

#read csv and sort 'Duration' largest to smallest
data = pd.read_csv('files/Lottery_Powerball_Winning_Numbers__Beginning_2010.csv').sort_values(by=['Multiplier'], ascending=False)

for i in data.index:
  if data.loc[i, "Multiplier"] >= 5:
    data.drop(i, inplace = True)

print("Largest multipliers below 5:")
print(data.head(10))

for j in data.index:
  if data.loc[j, "Draw Date"] == '12/25/2021':
    christmasWin = data.loc[j]
    
print('')
print("Lottery win on Christmas in 2021:")
print(christmasWin)
Largest multipliers below 5:
       Draw Date    Winning Numbers  Multiplier
425   03/25/2017  18 31 32 45 48 16         4.0
1050  03/26/2011  04 10 11 19 33 27         4.0
1049  03/30/2011  19 20 42 56 58 37         4.0
1046  04/09/2011  05 14 32 53 56 11         4.0
234   01/26/2019  08 12 20 21 32 10         4.0
1374  10/03/2022  02 16 22 55 63 22         4.0
1097  10/13/2010  12 22 32 34 46 02         4.0
1039  05/04/2011  03 15 27 29 41 24         4.0
1054  03/12/2011  01 04 12 41 47 03         4.0
1038  05/07/2011  02 11 27 47 55 15         4.0

Lottery win on Christmas in 2021:
Draw Date                 12/25/2021
Winning Numbers    27 29 45 55 58 02
Multiplier                       3.0
Name: 1255, dtype: object

Find the Average:

import pandas as pd

#read csv
data = pd.read_csv('files/Lottery_Powerball_Winning_Numbers__Beginning_2010.csv')
# Delete rows with no multiplier in it
data.dropna(inplace = True)

# Set up temporary counter and total sum
number = 0
sum = 0

# Iterate through the entire data set
for i in data.index:
  number = data.loc[i, "Multiplier"]
  # Each multiplier is added to the sum
  sum = sum + number

# Sum is divided by 1441(total rows of data) to get the mean(this means that rows with N/A multipliers are automatically counted as 0)
mean = sum/1441

# Print the average
print("Average multiplier values from 2010 to 2023:")
print(mean)
Average multiplier values from 2010 to 2023:
2.4205412907702986

Attempted to Do the Median:

import pandas as pd

data = pd.read_csv('files/Lottery_Powerball_Winning_Numbers__Beginning_2010.csv').sort_values(by=['Multiplier'], ascending=False)

# Find center values of the data(Problem: This data is not sorted)
num1 = data.loc[720, "Multiplier"]
num2 = data.loc[721, "Multiplier"]
median = (num1 + num2)/2
# Display the median
print("Median of the multipliers")
print(median)
print("Data sets in the middle:")
print(num1)
print(num2)
print(data)
Median of the multipliers
3.5
Data sets in the middle:
2.0
5.0
       Draw Date    Winning Numbers  Multiplier
1261  01/08/2022  20 21 36 60 65 13        10.0
1132  06/12/2010  09 12 13 35 38 30        10.0
42    02/20/2021  04 08 22 32 58 04        10.0
432   03/01/2017  10 16 40 52 55 17        10.0
173   08/28/2019  09 32 37 41 56 14        10.0
...          ...                ...         ...
961   02/01/2012  08 13 17 34 59 35         NaN
962   01/28/2012  05 33 41 54 59 13         NaN
963   01/25/2012  04 19 28 29 47 05         NaN
964   01/21/2012  12 24 43 44 45 07         NaN
965   01/18/2012  06 29 34 44 50 28         NaN

[1441 rows x 3 columns]

Machine Learning Visit Tutorials Point

Scikit-learn (Sklearn) is the most useful and robust library for machine learning in Python. It provides a selection of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction via a consistence interface in Python.

  • Description from ChatGPT. The Titanic dataset is a popular dataset for data analysis and machine learning. In the context of machine learning, accuracy refers to the percentage of correctly classified instances in a set of predictions. In this case, the testing data is a subset of the original Titanic dataset that the decision tree model has not seen during training......After training the decision tree model on the training data, we can evaluate its performance on the testing data by making predictions on the testing data and comparing them to the actual outcomes. The accuracy of the decision tree classifier on the testing data tells us how well the model generalizes to new data that it hasn't seen before......For example, if the accuracy of the decision tree classifier on the testing data is 0.8 (or 80%), this means that 80% of the predictions made by the model on the testing data were correct....Chance of survival could be done using various machine learning techniques, including decision trees, logistic regression, or support vector machines, among others.

  • Code Below prepares data for further analysis and provides an Accuracy. IMO, you would insert a new passenger and predict survival. Datasets could be used on various factors like prediction if a player will hit a Home Run, or a Stock will go up or down.

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Split arrays or matrices into random train and test subsets.
X = td.drop('survived', axis=1)
y = td['survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a decision tree classifier
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)

# Test the model
y_pred = dt.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print('DecisionTreeClassifier Accuracy:', accuracy)

# Train a logistic regression model
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

# Test the model
y_pred = logreg.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print('LogisticRegression Accuracy:', accuracy)
DecisionTreeClassifier Accuracy: 0.7705882352941177
LogisticRegression Accuracy: 0.788235294117647
/Users/johnmortensen/opt/anaconda3/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(