8 Machine Learning Projects for High School Students
You’ve probably been tuned in to the chatter about the new capabilities – good and bad – of AI. The initial plunge into such a vast field with so much critique can be intimidating. One of the main things beginners overestimate is the amount of background they actually need to get started. Truthfully, you’ll only need some fundamental Python knowledge and some idea of how to use common libraries like pandas, numpy, and sklearn. If you haven’t heard about some (or any) of these, don’t worry. They’re designed to be easy to interact with and there’s plenty of tutorials online to get you started.
Practice is the best way to learn the workflow and gain experience in AI. Understanding when and how to use certain models and algorithms is a huge benefit and critical skill you’ll need as you continue your interest in this field. Feeling overwhelmed on where to start? Don’t worry, we’ve got you! Here’s a list of the most common and flexible projects that teach you the fundamentals and then some.
1. The Titanic
The Titanic Survival competition on Kaggle is a staple in the community new to the platform and data science as a whole. Don’t let “competition” scare you – it’s just how this platform decides to host data and gets you accustomed to their platform if you decide to enter the real competitions (they offer some prizes and at the very least experience!). The task for this project is to create a model that can accurately predict which passengers survived the shipwreck. Prediction here just means that it’s a classification task!
This has a mix of categorical and numerical columns, providing a nice mixture of working with these two critical types of data. Simple models should do reasonably well, so no need to worry about having memorized every tool for the job. Even so, there’s a lot of ways to get creative and learn as you go!
An added benefit of this is getting accustomed to web-hosted data and the train/test separation of datasets. If you’re confused about anything, there’s plenty of videos and resources walking you through the process.
2. Wine Quality
Another great dataset is Wine Quality. This isn’t a competition, but you can also get this dataset on Kaggle.
Don’t worry, you don’t actually have to know anything about wine! This dataset has a good amount of columns to work with, all based on chemical principles. This isn’t why we love it, though. All the data is numerical (with the exception of the type of wine: white or red) and present.
While this one is technically classification in the sense that the “quality” is measured as ordinal (higher integers mean better quality), you could also see this one as a prediction. This gives you the creative freedom to experiment and gain some intuition about why we sometimes choose classification over regression even when the target is number.
Plotting is really nice for this one too. Box and whisker plots, scatter plots, correlation heatmaps, and line plots are all great when we have a lot of good quality data that’s also numerical. You can flex and strengthen your Seaborn and Matplotlib skills while you’re at it!
This project is also flexible in terms of modeling. You can predict the quality of the wine (which is technically the target), but also try to explore predicting the pH, the type of wine (white or red) and more.
3. Predict Credit Loan Default
Another staple for high school students, undergrads, and graduates alike is the Credit Loan Prediction project. You'll be using a variety of techniques like data visualization and feature selection. Like the previous few, this is a classification task. You’ll be predicting (classifying) whether a loan is accepted or rejected.
Let me give you an example: did you know it’s illegal to use gender information to inform loan / credit approvals in the United States? Suppose you didn’t know that and during your modeling you found gender was a really important factor. What happens if you use it and hand your super good model off to an executive? You guessed it: right to jail! I’m kidding…sort of.
Point is, ethics and legality in data science is very important to consider at all times for your safety and the safety and equity of others. Even if you aren’t breaking the law, your model could be using factors that cause a disparate impact on certain populations.
One big example is predictive policing, which was found to unfairly target minorities. Our goal is to avoid that behavior at all costs, even if it seemingly causes our model to do “worse”. This project will get you thinking about ethics early.
5. Malaria Cell Recognition
One of the most common starter projects for Computer Vision focuses on detecting Malaria infection in cells. This data is everywhere: in python libraries like Tensorflow and posted on various websites and repositories. This is also a classification task, to establish whether a cell is infected or not. The difference with this one is the kind of information you get. You’ll be looking at images, literal pixels coded with different variations of red, green, and blue to make up an image.
While jumping into Computer Vision might seem scary, I promise it makes more sense with practice. Further, this project is meant to be meaningful but also accommodate any skill level. You’ll need to know what a Convolutional Neural Network is on a high level and how to build a basic one with Tensorflow or Pytorch, but that’s it. You can make it as easy or difficult as you want!
Malaria is still a huge problem around the globe, and has recently been on the news as making a re-emergence in the southern US. Dipping your hands into the world that is image recognition opens doors for other projects in the medical field like assisted X-ray and Ultrasound readings and more.
5. House Pricing
This is another practice competition on Kaggle. Unlike the other datasets, this one is regression-based – you’ll want to predict an actual value for the sale price and you’ll have to log in to view the actual data.
Before that, you’ll notice that they still list all of the columns and, boy, are there a lot of them! If you’ve been dissatisfied with the previous recommendations because of the well-behaved data (real-world data is seldom well-behaved) then look no further. Along with a bunch of columns that may or may not encode similar information, there’s a lot of values that might take some time to interpret realistically.
This project is great for feature selection and gives you the opportunity to try some data cleaning and imputation. Visualization and Exploratory Data Analysis will also be a key here, since there’s way too many features to try and analyze without a visual aid. You can also explore the idea of correlation, and see how including or not including highly correlated features within your model is a good or bad idea.
6. MNIST Image Recognition
This is yet another practice competition on Kaggle. Like the Malaria project, this one is all image data. Unlike the Malaria project, the images are in black and white. The creators, Kaggle themselves, recommend this project if you’ve got some experience in Python and basic modeling but are new to computer vision.
This is one year shy of being considered an antique in the car industry (that’s 25 years old!), but it’s still the dataset we all recommend as a gentle introduction to image recognition. Since it’s vision, you’re not predicting the value of anything, but classifying it. Unlike the Malaria project, this is not binary “yes or no”, but a multi-class “which number is this from 0 to 9”. Thus, your goal is to use some classification algorithm to correctly learn what each handwritten digit is. You can experiment with any classification model, but this is a good place to try your hand at Neural Networks, either basic or Convolutional.
7. Spam Email Detection
Hate getting spam calls? Spam emails? Me too. This dataset provides you with email contents and whether it’s spam or not, which gives you the power to train a model to act as a spam filter.
Like some of the others, this project is also a classification task. It is flexible, allowing the use of more basic algorithms like logistic regression or more complicated models like naive bayes and neural networks. There are different flavors of this one out there. The dataset I’ve linked has the most common words pulled out and sorted for you, providing a lot of columns with all numerical data. Other versions provide you with the email contents, the literal text, and have you work with sentiment analysis techniques and methods to figure out if the emails are spam or not. I personally prefer the text version, but depending on how comfortable you are with modeling and sentiment analysis, this might be a good place to start before moving to the text-only version.
8. NBA Salaries
Ever wonder if there’s a formula out there used to indicate how much money the athletes in the NBA make? Now you can try and create one! If you’re into sports, this practice Kaggle competition is for you.
This one consists of regression since you’ll be looking to accurately model how much a given player makes given their stats and other information. While salary is the original target, you can also choose to model other things within the dataset and ask adjacent questions which can range from regression to classification.
There are quite a few features to explore in this dataset, but thankfully the data is well-behaved. You’ll get a lot of practice in Exploratory Data Analysis and feature selection. This is a great introduction to sport analytics and modeling, which is a whole application / sub-field of data science and machine learning.
As an add on, you can also consider joining the Veritas AI programs to get 1-1 mentorship to create your own machine learning models and projects! We have previously had students work on the projects mentioned above, and more. You can find a list of exciting projects here!
Whether you’re interested in healthcare or sports, finance or history, these Machine Learning and Data Science projects have you covered. While this is a great list to start with, there are thousands of beginner projects out there! It’s never too late to start learning something new. Happy modeling!