Unit 2.3 Extracting Information from Data, Pandas
Data connections, trends, and correlation. Pandas is introduced as it could be valuable for PBL, data validation, as well as understanding College Board Topics.
- Files To Get
- Pandas and DataFrames
- Cleaning Data
- Extracting Info
- Create your own DataFrame
- Example of larger data set
- APIs are a Source for Writing Programs with Data
- Hacks
- Hack Helpers
- Notes condensed
- Practice Problems
- 1Q: A researcher is analyzing data about students in a school district to determine whether there is a relationship between grade point average and number of absences. The researcher plans on compiling data from several sources to create a record for each student.
- 2Q: A team of researchers wants to create a program to analyze the amount of pollution reported in roughly 3,000 counties across the United States. The program is intended to combine county data sets and then process the data. Which of the following is most likely to be a challenge in creating the program?
- 3Q: A student is creating a Web site that is intended to display information about a city based on a city name that a user enters in a text field. Which of the following are likely to be challenges associated with processing city names that users might provide as input?
- 4Q:A database of information about shows at a concert venue contains the following information.
- 5Q: A camera mounted on the dashboard of a car captures an image of the view from the driver’s seat every second. Each image is stored as data. Along with each image, the camera also captures and stores the car’s speed, the date and time, and the car’s GPS location as metadata. Which of the following can best be determined using only the data and none of the metadata?
- 6Q: A teacher sends students an anonymous survey in order to learn more about the students’ work habits. The survey contains the following questions.
- Titanic Data
- Machine Learning
Files To Get
Save this file to your _notebooks folder
Save these files into a subfolder named files in your _notebooks folder
wget https://raw.githubusercontent.com/nighthawkcoders/APCSP/master/_notebooks/files/data.csv
wget https://raw.githubusercontent.com/nighthawkcoders/APCSP/master/_notebooks/files/grade.json
Save this image into a subfolder named images in your _notebooks folder
wget https://raw.githubusercontent.com/nighthawkcoders/APCSP/master/_notebooks/images/table_dataframe.png
Pandas and DataFrames
In this lesson we will be exploring data analysis using Pandas.
- College Board talks about ideas like
- Tools. "the ability to process data depends on users capabilities and their tools"
- Combining Data. "combine county data sets"
- Status on Data"determining the artist with the greatest attendance during a particular month"
- Data poses challenge. "the need to clean data", "incomplete data"
- From Pandas Overview -- When working with tabular data, such as data stored in spreadsheets or databases, pandas is the right tool for you. pandas will help you to explore, clean, and process your data. In pandas, a data table is called a DataFrame.
'''Pandas is used to gather data sets through its DataFrames implementation'''
import pandas as pd
import json
with open('users.json') as f:
data = json.load(f)
username = input("write your username")
print(username)
password = input("write your password")
print(password)
confirm_password = input("confirm the password")
print("confirm_password is ", confirm_password)
a_dict = {'username': username, 'password': password}
i = 0
if password == confirm_password:
while True:
if i >= len(data):
data.append(a_dict)
with open('users.json', 'w') as f:
json.dump(data, f)
print("register success")
break
if username == data[i]["username"]:
print("username already existed")
break
else:
i += 1
else:
print("password is not matched")
import json
login_user = str(input("what is your username"))
print(login_user)
login_password = str(input("what is password"))
print(login_password)
i = 0
login_users = pd.read_json('users.json')
while True:
if i >= len(login_users):
print("password or username is not matched")
break
if (login_password == login_users["password"][i]) & (login_user == login_users["username"][i]):
print("Welcome", login_user)
break
else:
i += 1
df = pd.read_json('files/grade.json')
print(df)
# What part of the data set needs to be cleaned?
# From PBL learning, what is a good time to clean data? Hint, remember Garbage in, Garbage out?
print(df[['GPA']])
print()
#try two columns and remove the index from print statement
print(df[['Student ID','GPA']].to_string(index=False))
print(df.sort_values(by=['GPA']))
print()
#sort the values in reverse order
print(df.sort_values(by=['GPA'], ascending=False))
print(df[df.GPA > 3.00])
print(df[df.GPA == df.GPA.max()])
print()
print(df[df.GPA == df.GPA.min()])
import pandas as pd
#the data can be stored as a python dictionary
dict = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
#stores the data in a data frame
print("-------------Dict_to_DF------------------")
df_csv = pd.DataFrame(dict)
print(df_csv)
print("----------Dict_to_DF_labels--------------")
#or with the index argument, you can label rows.
df_day = pd.DataFrame(dict, index = ["day1", "day2", "day3"])
print(df_day)
print("-------Examine Selected Rows---------")
#use a list for multiple labels:
print(df_day.loc[["day1", "day3"]])
#refer to the row index:
print("--------Examine Single Row-----------")
print(df_day.loc["day1"])
print(df_day.info())
import pandas as pd
#read csv and sort 'Duration' largest to smallest
data_duration = pd.read_csv('files/data.csv').sort_values(by=['Duration'], ascending=False)
print("--Duration Top 10---------")
print(data_duration.head(10))
print("--Duration Bottom 10------")
print(data_duration.tail(10))
'''Pandas can be used to analyze data'''
import pandas as pd
import requests
def fetch_covid():
'''Obtain data from an endpoint'''
url_covid = "https://flask.nighthawkcodingsociety.com/api/covid/"
fetch_covid = requests.get(url_covid)
json_covid = fetch_covid.json()
# filter data for requirement
covid = pd.DataFrame(json_covid['countries_stat']) # filter endpoint for country stats
print(covid.loc[0:5, 'country_name':'deaths']) # show row 0 through 5 and columns country_name through deaths
fetch_covid()
import pandas as pd
import requests
def fetch_users():
'''Obtain data from an endpoint'''
url = "https://dolphin.nighthawkcodingsociety.com/api/users/"
fetch_user = requests.get(url)
json_user = fetch_user.json()
# filter data for requirement
df_user = pd.DataFrame(json_user)
# print(df)
print(df_user.loc[0:5, ['username', 'password', 'age']])
fetch_users()
import pandas as pd
import requests
def fetch_login():
'''Obtain data from an endpoint'''
url = "https://dolphin.nighthawkcodingsociety.com/api/users/"
fetch_user = requests.get(url)
json_user = fetch_user.json()
username = input("write your username")
print(username)
password = input("write your password")
print(password)
i = 0
while True:
if i >= len(data):
print("password or username is not matched")
break
if (username == json_user[i]['username']) & (password == json_user[i]['password']):
print("Welcome", username)
break
else:
i += 1
fetch_login()
Hacks
AP Prep
- Add this Blog to you own Blogging site. In the Blog add notes and observations on each code cell.
- In blog add College Board practice problems for 2.3.
The next 4 weeks, Teachers want you to improve your understanding of data. Look at the blog and others on Unit 2. Your intention is to find some things to differentiate your individual College Board project.
- Create or Find your own dataset. The suggestion is to use a JSON file, integrating with your PBL project would be Fambulous.
When choosing a data set, think about the following:- Does it have a good sample size? - Is there bias in the data?
- Does the data set need to be cleaned?
- What is the purpose of the data set?
- ...
- Continue this Blog using Pandas extract info from that dataset (ex. max, min, mean, median, mode, etc.)
Hack Helpers
Here is how Mort started on this assignment by asking ChatGPT ... Regarding Python Pandas, what are some data sets that would be good for learning Pandas?
-
There are many data sets that are suitable for learning pandas, depending on your interests and the skills you want to develop. Here are some suggestions:- Titanic Dataset: This is a classic dataset for data analysis and machine learning, and is often used as an introduction to pandas. The dataset contains information about passengers on the Titanic, including their demographics, ticket class, and survival status.
-
Iris Dataset: This is another classic dataset that is often used in machine learning and data analysis courses. The dataset contains information about iris flowers, including their sepal length, sepal width, petal length, and petal width.
-
NBA Player Stats: This dataset contains information about NBA players, including their stats for various games, their positions, and their salaries. It is a great dataset for learning how to clean and manipulate data using pandas.
-
Housing Prices Dataset: This dataset contains information about housing prices in a particular city or region, including variables such as the number of bedrooms, square footage, and price. It is a great dataset for learning how to manipulate and analyze numerical data using pandas.
-
COVID-19 Dataset: This dataset contains information about the COVID-19 pandemic, including the number of cases, deaths, and recoveries in various countries and regions. It is a great dataset for learning how to manipulate and analyze time-series data using pandas.
-
World Bank Dataset: This dataset contains information about economic indicators from various countries, such as GDP, population, and poverty rates. It is a great dataset for learning how to manipulate and analyze large datasets using pandas.
-
-
These are just a few examples of the many datasets that are suitable for learning pandas. The best dataset for you will depend on your interests and the specific skills you want to develop.
Follow up question, "where can I find Titanic data set?"
-
The Titanic dataset is a popular dataset for data analysis and machine learning, and it can be found on various websites and data repositories. Here are a few places where you can find the Titanic dataset:- Kaggle: The Titanic dataset is available on Kaggle, which is a popular platform for data scientists and machine learning enthusiasts. You can download the dataset from the Kaggle website after creating an account.
-
UCI Machine Learning Repository: The Titanic dataset is also available on the UCI Machine Learning Repository, which is a collection of datasets that are commonly used for machine learning research. You can download the dataset from the UCI Machine Learning Repository website.
-
Seaborn library: If you have the Seaborn library installed in your Python environment, you can load the Titanic dataset directly from the library using the following code:
import seaborn as sns titanic_data = sns.load_dataset('titanic')
-
Notes condensed
- Read just produces all the data
- Pandas Can sort items inside the json list
- Can produce the min and max from the json
- You can choose only certin keys to produce certin table
- Panda can read multible different data formats and still put them into tables
- The pd.DataFrame makes the data look like its in tables using a json file
Practice Problems
1Q: A researcher is analyzing data about students in a school district to determine whether there is a relationship between grade point average and number of absences. The researcher plans on compiling data from several sources to create a record for each student.
The researcher has access to a database with the following information about each student.
Last name
First name
Grade level (9, 10, 11, or 12)
Grade point average (on a 0.0 to 4.0 scale)
The researcher also has access to another database with the following information about each student.
First name
Last name
Number of absences from school
Number of late arrivals to school
Upon compiling the data, the researcher identifies a problem due to the fact that neither data source uses a unique ID number for each student. Which of the following best describes the problem caused by the lack of unique ID numbers?
1A: Students who have the same name may be confused with each other.
2Q: A team of researchers wants to create a program to analyze the amount of pollution reported in roughly 3,000 counties across the United States. The program is intended to combine county data sets and then process the data. Which of the following is most likely to be a challenge in creating the program?
(A) A computer program cannot combine data from different files.
(B) Different counties may organize data in different ways.
(C) The number of counties is too large for the program to process.
(D) The total number of rows of data is too large for the program to process.
2A: B
3Q: A student is creating a Web site that is intended to display information about a city based on a city name that a user enters in a text field. Which of the following are likely to be challenges associated with processing city names that users might provide as input?
Select two answers.
(A) Users might attempt to use the Web site to search for multiple cities.
(B) Users might enter abbreviations for the names of cities.
(C) Users might misspell the name of the city.
(D) Users might be slow at typing a city name in the text field.
3A: B and C
4Q:A database of information about shows at a concert venue contains the following information.
Name of artist performing at the show
Date of show
Total dollar amount of all tickets sold
Which of the following additional pieces of information would be most useful in determining the artist with the greatest attendance during a particular month?
(A) Average ticket price
(B) Length of the show in minutes
(C) Start time of the show
(D) Total dollar amount of food and drinks sold during the show
4A: A
5Q: A camera mounted on the dashboard of a car captures an image of the view from the driver’s seat every second. Each image is stored as data. Along with each image, the camera also captures and stores the car’s speed, the date and time, and the car’s GPS location as metadata. Which of the following can best be determined using only the data and none of the metadata?
(A) The average number of hours per day that the car is in use
(B) The car’s average speed on a particular day
(C) The distance the car traveled on a particular day
(D) The number of bicycles the car passed on a particular day
5A: D
6Q: A teacher sends students an anonymous survey in order to learn more about the students’ work habits. The survey contains the following questions.
On average, how long does homework take you each night (in minutes)?
On average, how long do you study for each test (in minutes)?
Do you enjoy the subject material of this class (yes or no)?
Which of the following questions about the students who responded to the survey can the teacher answer by analyzing the survey results?
I. Do students who enjoy the subject material tend to spend more time on homework each night than the other students do?
II. Do students who spend more time on homework each night tend to spend less time studying for tests than the other students do?
III. Do students who spend more time studying for tests tend to earn higher grades in the class than the other students do?
(A) I only
(B) III only
(C) I and II
(D) I and III
6A: C
import seaborn as sns
# Load the titanic dataset
titanic_data = sns.load_dataset('titanic')
print("Titanic Data")
print(titanic_data.columns) # titanic data set
print(titanic_data[['survived','pclass', 'sex', 'age', 'sibsp', 'parch', 'class', 'fare', 'embark_town']]) # look at selected columns
Use Pandas to clean the data. Most analysis, like Machine Learning or even Pandas in general like data to be in standardized format. This is called 'Training' or 'Cleaning' data.
# Preprocess the data
from sklearn.preprocessing import OneHotEncoder
td = titanic_data
td.drop(['alive', 'who', 'adult_male', 'class', 'embark_town', 'deck'], axis=1, inplace=True)
td.dropna(inplace=True)
td['sex'] = td['sex'].apply(lambda x: 1 if x == 'male' else 0)
td['alone'] = td['alone'].apply(lambda x: 1 if x == True else 0)
# Encode categorical variables
enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(td[['embarked']])
onehot = enc.transform(td[['embarked']]).toarray()
cols = ['embarked_' + val for val in enc.categories_[0]]
td[cols] = pd.DataFrame(onehot)
td.drop(['embarked'], axis=1, inplace=True)
td.dropna(inplace=True)
print(td)
The result of 'Training' data is making it easier to analyze or make conclusions. In looking at the Titanic, as you clean you would probably want to make assumptions on likely chance of survival.
This would involve analyzing various factors (such as age, gender, class, etc.) that may have affected a person's chances of survival, and using that information to make predictions about whether an individual would have survived or not.
-
Data description:- Survival - Survival (0 = No; 1 = Yes). Not included in test.csv file. - Pclass - Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
- Name - Name
- Sex - Sex
- Age - Age
- Sibsp - Number of Siblings/Spouses Aboard
- Parch - Number of Parents/Children Aboard
- Ticket - Ticket Number
- Fare - Passenger Fare
- Cabin - Cabin
- Embarked - Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
-
Perished Mean/Average
print(titanic_data.query("survived == 0").mean())
- Survived Mean/Average
print(td.query("survived == 1").mean())
Survived Max and Min Stats
print(td.query("survived == 1").max())
print(td.query("survived == 1").min())
Machine Learning
From Tutorials Point%20is,a%20consistence%20interface%20in%20Python). Scikit-learn (Sklearn) is the most useful and robust library for machine learning in Python. It provides a selection of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction via a consistence interface in Python.> Description from ChatGPT. The Titanic dataset is a popular dataset for data analysis and machine learning. In the context of machine learning, accuracy refers to the percentage of correctly classified instances in a set of predictions. In this case, the testing data is a subset of the original Titanic dataset that the decision tree model has not seen during training......After training the decision tree model on the training data, we can evaluate its performance on the testing data by making predictions on the testing data and comparing them to the actual outcomes. The accuracy of the decision tree classifier on the testing data tells us how well the model generalizes to new data that it hasn't seen before......For example, if the accuracy of the decision tree classifier on the testing data is 0.8 (or 80%), this means that 80% of the predictions made by the model on the testing data were correct....Chance of survival could be done using various machine learning techniques, including decision trees, logistic regression, or support vector machines, among others.
- Code Below prepares data for further analysis and provides an Accuracy. IMO, you would insert a new passenger and predict survival. Datasets could be used on various factors like prediction if a player will hit a Home Run, or a Stock will go up or down.
- Decision Trees, prediction by a piecewise constant approximation. - Logistic Regression, the probabilities describing the possible outcomes.
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Split arrays or matrices into random train and test subsets.
X = td.drop('survived', axis=1)
y = td['survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Train a decision tree classifier
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
# Test the model
y_pred = dt.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print('DecisionTreeClassifier Accuracy:', accuracy)
# Train a logistic regression model
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
# Test the model
y_pred = logreg.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print('LogisticRegression Accuracy:', accuracy)