Everything can be represented by data making it an essential part of both computing and Machine Learning. The efficiency of Machine Learning relies heavily on its datasets to perform properly. But how do you determine which data set is the best for your project? Here’s a list of the top 10 free and easily accessible online Machine Learning datasets.
Contains the sleep time and weight of specific mammals and consists of 11 variables and can be used to understand mammalian sleeping patterns.
name: common name
genus: taxonomic rank
vore: carnivor, omnivore or herbivor
order: taxonomic rank
conservation: status of the mammal
sleep_total: total amount of sleep measured in hours
sleep_rem: rem sleep measured in hours
sleepy_cycle: length of sleep cycle measured in hours
awake: time spent awake measured in hours
brainwt: brain weight in kilograms
boydwt: body weight in kilograms
Consists of the sales of car seats from 400 different store locations with 11 variables. Each of the following variables are measured in increments of thousands.
Sales: unit sales at each location
CompPrice: Price charged by competitor at each location
Income: Community income level measured in thousands of dollars
Advertising: Local advertising budget for the company at each location
Population: Population size in region
Price: Price the company charges for car seats at each site
ShelveLoc: Measured in Bad, Good and Medium indicating the quality of the shelving location for the car seats at each location
Age: Average age of the local population
Education: Education level at each location
Urban: Yes/ No to indicate if the store is in an urban or rural location
US: Yes/No to indicate if the store is in the US or not
Contains information regarding almost 54,000 diamonds with ten variables.
Carat: weight of the diamond
Cut: quality of the diamond measured from Fair, Good, Very Good, Premium, Ideal
Color: color of the diamond measured from D, the best, to J, the worst
Clarity: how clear the diamond measured by the following scale (worst to best): I1, SI2, SI1, VS2, VS1, VVS2, VVS1, IF
Depth: total depth percentage, calculated using the x, y and z variables
Table: width of top of the diamond in relation to its widest point
Price: amount in USD
X: length in millimeters
Y: width in millimeters
Z: length in millimeters
This dataset allows you to contribute your recordings of spoken digits as long as they are 8kHz wav files and in English. The recordings are also trimmed at the beginning and end for minimal silence. As an open dataset, it is expected to grow over time as contributions trickle in. This dataset hopes to solve digit pronunciation problems and at the time of this post, consists of six speakers, with 3,000 recordings (50 of each digit per speaker).
Wikipedia is not only a resource for students with research papers, but also a very useful tool for Natural Language Processing researchers. This dataset consists of nearly 1.9 billion words from more than 4 million Wikipedia articles that can be searched by words, phrases, and paragraphs.
Subjects of this dataset consist mostly of male and female adults, ranging between the ages of 18-20 years old, from various ethnicities. The objective of this dataset is to help distinguish not only between genders but also emotions. As part of the dataset, images with a resolution of 180*200 pixels were taken of the female and female subjects. In total, nearly 400 individuals participated with 20 images taken per each subject. Now, anyone can download this dataset as a zip file.
Ham or spam? This dataset helps predict whether a text is ham (legit) or spam. Consisting of more than 5,500 messages in English, this dataset is beginner-friendly and simple to comprehend. By using a comma-separated value format and one message per line made up of two columns: v1, ham or spam, and v2, the raw text this data set is novice approved.
Like the Spam SMS Classifier dataset, this dataset is beginner-friendly and useful in understanding the techniques and deep learning recognition pattern of real-world data. With over 70,000, 28x28, grayscale pixel images, this set was created to replace the original MNIST dataset to become the new benchmark for algorithms. In this dataset each pixel has a pixel-value integer running from 0 to 255 associated with it, the bigger numbers representing the darkest pixel.
Used often to help with classification problems in machine learning, this dataset describes the cell nuclei characteristics present in the image with the following real-valued features:
Texture (standard deviation of gray-scale values)
Compactness (perimeter^2 / area - 1.0)
Used by R.A. Fisher, statistical science genius, in 1936 this dataset can still be used to build simple projects in machine learning algorithms and is beginner-friendly. The dataset is small and consists of four attributes all measured in centimeters: sepal length, sepal width, petal length and petal width with three classes: Virginica, Setosa and Versicolor.
Creating datasets for machine learning is a laborious human task, but luckily there are several public datasets available. The datasets mentioned above are user-friendly, but rest assured there are plenty of other accessible datasets available for use, regardless of your project or use case.