Veritas AI

View Original

10 Data Visualization Project Ideas for High School Students

Data visualization projects offer an exciting gateway into data science, allowing high school students to creatively explore real-world data and develop critical skills in coding, analysis, and storytelling. These projects go beyond mere graphs—they teach you how to turn raw data into a compelling narrative, whether you're examining environmental shifts, social media dynamics, or historical trends. Through these experiences, you can learn to present complex information in ways that are visually appealing and easy to understand.

In addition to enhancing your skills, data visualization projects could be a great way to make your college applications shine. By demonstrating your ability to solve real-world problems and your dedication to pursuing knowledge beyond the classroom, these projects stand out as impressive achievements. They highlight your critical thinking skills, your understanding of data, and your ability to communicate insights effectively.

Here are 10 diverse and engaging data visualization project ideas suitable for high school students, irrespective of their experience level. Whether you start with simple bar charts and line graphs or dive into more complex visuals like scatter plots and interactive dashboards, there's a project for everyone.


Project Idea 1: Stroke Risk Prediction in Patients 

Strokes are among the leading causes of death worldwide, and data science can help identify high-risk patients before they experience severe symptoms. In this project, you will build a machine-learning model that predicts stroke likelihood based on a variety of patient factors, such as age, lifestyle, and pre-existing conditions. Data science has huge potential to be used in healthcare and this is a good project idea if you are looking to explore the intersection of healthcare and data science. 

To get started, you will be working with the Electronic Health Records (EHR) dataset, which is available on Kaggle and compiled by McKinsey & Company. You will need to define key risk factors for stroke and suggest preventive measures that should be adopted to reduce or delay the onset of a stroke. You will undertake Exploratory Data Analysis (EDA) to look for inconsistencies in the data and once you have a consistent dataset, you can use scatter plots, pie charts, and bars to present relevant findings. 

The final step of the project is building a predictive model. You will be required to use algorithms like random forests and decision trees to develop a predictive model for stroke risk using the processed dataset, which consists of an even distribution of stroke and no-stroke cases. This explanation on Kaggle will help you understand the basics of modeling data for stroke cases. 

Level of knowledge required: Intermediate

Skills required: Machine learning basics, statistical modeling, and familiarity with Python libraries such as Pandas and Plotly

Coding background required: Advanced Python

Potential drawbacks: The data set does not account for all stroke risk factors, such as ethnic backgrounds, which play a role in non-communicable diseases, so results might not be accurate. Feature engineering is used in this project to identify risk factors, which is a time-consuming process and beginners with limited coding knowledge might find it challenging.


Project Idea 2: Social Media Content Sentiment Analysis


Social media is a great resource of data and you can define the parameters to study social media posts relevant to your projects. This type of project uses Natural Language Processing (NLP) techniques and all the sentiments from social media posts are classified as neutral, positive, or negative. This can be a powerful way to gauge public opinion or understand how sentiment fluctuates over time.

T
o begin, you’ll need to gather a dataset of social media posts. Fortunately, this step has already been done for you—a dataset of tweets is available on Kaggle, perfect for beginners looking to get started without the hassle of scraping data. While the dataset may be slightly outdated, it still serves as an excellent learning tool for understanding sentiment analysis

The next step will involve using NLP techniques to analyze the data. You will start with cleaning the data, which basically involves removing special characters and numbers from the text. This is followed by tokenization is the process of breaking down a text into smaller chunks called tokens. When you have done this, you will use logistic regression and apply the Bag of Words (BoW) approach to classify and group the data points. 

You can create visualizations to represent the findings in multiple ways. Word clouds are particularly effective at displaying frequently used words associated with positive or negative sentiments, offering a quick, visually striking representation of how people feel about a topic. Bar plots, on the other hand, provide a clear breakdown of sentiment distribution, showing whether a topic tends to generate positive, negative, or neutral reactions. This project is an excellent introduction to both NLP and data visualization, giving you the chance to explore how social media data can reveal meaningful insights about public opinion.


Level of knowledge required: Beginner

Skills Required: Basic Python and NLP techniques. Note: here’s our guide on useful resources to learn Python. An understanding of Natural Language Processing will be useful.
Coding background required: Intermediate Python

Potential drawbacks: There are some data privacy concerns depending on the dataset. The model might be ineffective at detecting double negatives, sarcasm, and sentiments from lengthy sentences. 


Project Idea 3: Analyzing Forest Fire to predict future patterns

This is a beginner-level data analysis project in which you will be using R, a programming language useful for statistical computing and data visualization. You'll work with a dataset on forest fires in Portugal, provided by the UCI Machine Learning Repository, to find forest fire patterns. Using R and various data visualization techniques, you'll study factors such as temperature, humidity, and wind speed and their correlation with the spread of the fire. 

You’ll begin by importing the dataset and using Tidyverse to help clean and processing of the data. Then you will need to categorize data points for all the factors you want to include in the study. You can start by making simple graphs correlating fire occurrence patterns based on the month and day of the week while box plots will help you identify how they correlate with the occurrence of forest fires.

For instance, you can create simple line graphs to show fire frequency based on the day of the week or the month. Box plots can further enhance your analysis by illustrating the relationship between temperature and fire intensity. The heatmap function will help you create a representation, in which users will be able to see a correlation between factors leading to forest fires. 


Level of knowledge required: Beginner

Skills Required: Some experience with R is required and you will need to know the basics of variables, data types, and data structures in R
Coding background required: Intermediate R

Potential drawbacks: While the dataset offers valuable insights, it’s static, meaning it won’t account for real-time data or evolving conditions. To make accurate predictions in the future, you would need to regularly update the dataset with new information.


Project Idea 4: COVID-19 infections visualization

Although the COVID-19 pandemic is behind us, researchers have been using the experience to study and understand infectious diseases better. As there has been a lot of data generated related to the disease, studying it as a project can help you explore some of the basics of data visualization. 

You will need to use the dataset from the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University, which has a comprehensive collection of infection rates and death rates from all across the globe. While the data has not been updated since March 2023, it offers a comprehensive historical record of the pandemic’s impact across regions. 

While this project doesn’t involve predictive modeling, it provides an opportunity to practice descriptive data analysis and make sense of large, complex datasets. To begin, you’ll import the dataset into Python and clean it for analysis. With the vast amount of data available, you can choose to focus on specific regions, time frames, or trends. Visualizing the spread of infections and mortality rates through pie charts, line graphs, or bar plots can help you explore how the virus impacted different countries or how infection rates changed over time.

One particularly engaging visualization could involve tracking the spread of COVID-19 from its initial outbreak to the global pandemic stage. You might also create heatmaps to highlight the regions with the highest infection rates or use line graphs to illustrate how infection waves rose and fell. 

Level of knowledge required: Intermediate

Skills Required: Some experience with Python and various libraries is required and you will need to know the basics of variables, data types, and data structures in Python
Coding background required: Intermediate Python

Potential drawbacks: As the dataset is no longer updated, this project focuses on historical analysis rather than real-time predictions, limiting its scope for future developments. 


Project Idea 5: Fandango Movie Ratings Analysis

In 2015, the popular ticketing company, which also provided movie ratings out of five stars, found itself in a bit of a controversy as FiveThirtyEight found some discrepancies in its rating system. They also shared the dataset with the public in 2015, using which you can work on an interesting project to compare if the rating system changed in 2016. Using R and statistical analysis, you'll study rating distributions, compile summary statistics, and visualize changes in rating patterns, if any. 

You’ll start by loading the 2015 and 2016 Fandango movie ratings datasets. The next step would involve data preprocessing, which includes selecting samples for analysis. You will then use kernel density plots to study the distribution of movie reviews across 2015 and 2016. Finally, you will visualize changes in rating patterns using bar plots to compare the average ratings before and after the controversy, and summary statistics will offer further insight into whether the bias was corrected. 

This project offers a hands-on lesson in both the power and limitations of data analysis. You’ll not only learn how to clean and process real-world data but also how to use visualizations to tell a story. Through this analysis, you’ll explore the broader implications of how biases in data can influence public perception.

Level of knowledge required: Beginner

Skills Required: Some experience with R is required and you will need to know the basics of data manipulation, hypothesis testing, and statistical inference.
Coding background required: Intermediate R

Potential drawbacks: Since this project is based on historical data, it’s focused on analysis rather than predictive modeling.


Project Idea 6: Amazon's Top 50 Bestselling Books from 2009 to 2019 Analysis

This is a classical beginner’s project for students interested in exploring the basics of data analysis and visualization. You will learn about exploratory data analysis and data cleaning while also working with multiple Python libraries. You will use the dataset from Kaggle, which has a list of 550 books that have been categorized into fiction and non-fiction. 

The initial step is to ensure that you have imported all the necessary Python libraries as this project will require working with pandas, numpy, fuzzywuzzy, matplotlib, seaborn, and plotly. The second step is data cleaning and preprocessing as the same books are sold at different prices over the years and can be duplicated in the dataset. Once you have a clean dataset, you need to create new data frames based on the average rating of authors, the number of books written by authors, total reviews for books, and total books in each genre. This will form the basis of your visualization. 

Bar charts can display the most popular genres, while scatter plots might reveal relationships between book length and user satisfaction. You could even create heatmaps to showcase which authors dominated the bestseller list during specific years. Once you have created charts to explore the correlation between user ratings, prices, and genres, you can draw conclusions such as which author had the highest ratings between 2009 and 2019. 

Level of knowledge required: Beginner

Skills Required: Some experience with Python and various libraries is required and you will need to know the basics of variables, data types, and data structures in Python
Coding background required: Intermediate Python

Potential drawbacks: Data cleaning can be time-consuming due to the presence of duplicates or inconsistencies in the dataset.


Project Idea 7: Anomaly Detection in Credit Card Transactions

This project, though beginner-friendly, involves the use of not just one but two different tools for a complete data visualization. In this project, you’ll build a machine-learning model that identifies fraudulent transactions by analyzing patterns in large datasets. First, you’ll use Python for data cleaning and process it using exploratory data analysis (EDA), which helps in visualizing distributions, trends, and relationships between variables using seaborn and matplotlib.

A machine learning algorithm such as a decision tree will help a model that detects anomalous transactions. For a detailed visualization and report, you will use Power BI, a great data visualization software from Microsoft. You can use this dataset from Kaggle for your project. 


Building a model that correctly detects anomalous transactions is the key step in this project and you will also need to work diligently to ensure that the data is processed properly. For visualization, you’ll use Power BI, a powerful data visualization tool that allows you to create detailed dashboards. You can generate interactive graphs showing which factors (like transaction size or location) are most indicative of fraud. 

Level of knowledge required: Beginner

Skills Required: Some experience with Python and various libraries is required and you will need to know the basics of data processing and machine learning algorithms
Coding background required: Intermediate Python, some knowledge of Power BI is also required.

Potential drawbacks: The dataset is highly imbalanced, which can make training an accurate model challenging. You’ll need to experiment with different algorithms and techniques to improve accuracy.


Project Idea 8: Time Series Analysis of Currencies against US Dollar

Time Series Analysis is one of the best techniques to display data in a fun and interactive way. Using Python and Plotly, one of the libraries that includes over 30 different types of charts, you can conduct a project comparing different currencies against the US Dollar. The dataset that you’ll be using for this project has been collected by Yahoo Finance for 38 Asian currencies between 2004 and December 2022. It can be accessed on Kaggle here.

As is the case for most data visualization projects, the first step is to make sure you have all the necessary libraries installed and for this project, they’ll be numpy, pandas, matplotlin, seaborn, and plotly. You will need to clean and process the data to ensure there are no missing values and then create new dataframes for each currency, allowing you to visualize the fluctuations of individual currencies against the US dollar. 


A line graph is a simple yet effective way to display these fluctuations over time, helping you trace major changes in exchange rates. For example, you could track how the 2008 financial crisis or the 2020 COVID-19 pandemic affected currency values. Scatter plots can be used to pinpoint when the currency exchange rate peaked or hit rock bottom. 

Level of knowledge required: Intermediate

Skills Required: Some experience with Python and various libraries is required and you will need to know the basics of time series analysis and data processing
Coding background required: Intermediate Python

Potential drawbacks: Data processing is required separately for each currency and creating a graph for each one can be a bit tiring and time-consuming


Project Idea 9: Customer Churn Analysis

Customer churn is a key term for businesses as it indicates the customer attrition rate for any particular business at a given time. This project has practical applications and if you can complete it successfully, it will look good on your resume. This project is slightly different from others on the list, as this does not involve the use of any programming language but rather two tools from Microsoft. 

You will use Power Query, which can be used to extract, transform, and load (ETL) processing of data. Think of it as an advanced version of Microsoft Excel. You will use this dataset from Github and Power Query will help you with data preparation, categorization, grouping, formatting, data transformation, and data modeling. You’ll categorize customers based on various factors such as subscription length, customer satisfaction, and interaction history. Then, you’ll visualize the data in Power BI, using tools like pie charts or bar graphs to display the distribution of customer churn across different categories.

To add depth to your analysis, you can use Data Analysis Expressions (DAX), which is a query language similar to SQL. DAX enables you to create more complex calculations and comparisons, providing deeper insights into why certain customers are leaving. 

Level of knowledge required: Intermediate

Skills Required: Some experience with Microsoft Excel and Power BI is required
Coding background required: Intermediate query language experience with SQL.  

Potential drawbacks: If you’re not familiar with Power Query or DAX, there may be a learning curve as you adjust to the syntax and functionality. Additionally, if the dataset isn’t properly cleaned, your visualizations may not be accurate.


Project Idea 10: Iris Flower Classification

Iris flower classification is a machine learning project useful for beginners to develop an understanding of dataset processing and classification. In this project, you will use the Iris flower data set, which was created by Ronald Fisher in the 1930s. The dataset details a few features of different varieties of Iris flowers, including, the dimensions of sepals and petals (length and breadth, separately).

This project can also be used to create informative scatter plots and histograms that visualize the differences between the species. For example, a scatter plot showing petal length vs. petal width can reveal clear distinctions between the three species.

After exploring the data, you’ll use a machine-learning algorithm to build a model that classifies each flower species. This Kaggle notebook shows how four different models can be used for data classification and you can refer to it for guidance. You’ll train the model on a portion of the dataset, and then test its accuracy on the remaining data. Finally, you’ll visualize the classification results to see how well your model distinguishes between the species.

Level of Knowledge Needed: Beginner

Skills Required: Machine learning, Python programming, familiarity with neural networks, scikit-learn and graphs                           

Coding Background Requirements: Introductory Python and machine learning algorithms.              

Potential Drawbacks: Nothing in particular


If you’re looking to build a project/research paper in the field of AI & ML, consider applying to Veritas AI! 

Veritas AI is founded by Harvard graduate students. Through the programs, you get a chance to work 1:1 with mentors from universities like Harvard, Stanford, MIT, and more to create unique, personalized projects. In the past year, we had over 1000 students learn AI & ML with us. You can apply here!


Image source - Veritas AI Logo