20+ Amazing (And Free) Data Sources Anyone Can Use To Build AIs

Bernard Marr

📖 Internationally Best-selling #Author🎤 #KeynoteSpeaker🤖 #Futurist💻 #Business, #Tech & #Strategy Advisor

Published Jun 2, 2023

Thank you for reading my latest article 20+ Amazing (And Free) Data Sources Anyone Can Use To Build AIs. Here at LinkedIn and at Forbes I regularly write about management and technology trends.

To read my future articles simply join my network here or click 'Follow'. Also feel free to connect with me via Twitter, Facebook, Instagram, Slideshare or YouTube.

---------------------------------------------------------------------------------------------------------------

When we talk about artificial intelligence (AI) in business and society today, what we really mean is machine learning (ML). This refers to applications that use algorithms (a set of instructions) to become increasingly good at performing a particular task as it is exposed to more and more data relating to that task.

These tasks could be anything from answering questions and creating text or images (as demonstrated by apps like ChatGPT or Dall-E) to recognizing images (computer vision) or navigating self-driving autonomous vehicles from A to B.

All of these tasks require data, and businesses that want to train their own ML algorithms in order to automate their day-to-day tasks need sources of data.

What types of data are there?

Business data is commonly divided into one of two categories – internal and external data.

Internal data is data collected by organizations themselves from within their own operations. This commonly includes financial data, customer feedback data, HR data, operational data, and many more sources. Data collected by an organization monitoring its own operations is said to be proprietary data, and is valuable because it gives information specific to that business.

External data comes from sources outside of the organization and is typically collected from third-party data sources such as those listed below. If data is freely available to anyone, it is called open data.

Further to this, data can also be classified as either structured, unstructured, or semi-structured data.

Structured data is information that fits nicely and neatly into tables – for example, sales data showing what products were sold by a business, when, where, and at what price would be an example of internal, structured data. Alternatively, it might choose to analyze historical market data and economic indicators to predict future movements in the markets they operate in (structured, external data).

Unstructured data is everything else – for example, pictures, videos, text, and social media posts. It can certainly contain valuable insights but is more difficult to analyze. AI, however, has proven particularly useful for extracting meaning from unstructured data. Image recognition algorithms, for example, might tell a business useful facts about customer behavior by analyzing in-store CCTV images (internal, unstructured data). They might also find valuable insights by analyzing images related to the business posted on social media (unstructured, external data).

Luckily, data is everywhere. Whatever you’re trying to do, if it requires external data, there’s likely to be a source for it online. Governments, research institutions, private companies, and non-governmental organizations all routinely make data freely available for research and even commercial purposes. So here are some of the best sources of free online data available in 2023.

Data Search Engines and Repositories

Google Dataset Search – This is actually a search engine for datasets cataloged by Google; use this to find data on just about anything you could need.

AWS Open Data Search – Another dataset search engine, this one, is provided by Amazon's AWS service.

Microsoft Research Open Data – Free, open datasets collected by Microsoft, with a mainly scientific focus.

UCI Machine Learning Repository – A repository of more than 600 open datasets curated and maintained by the University of California, Irvine, and made available for the purpose of training machine learning algorithms.

Kaggle Datasets – Online data science platform Kaggle also offers a curated catalog of datasets covering everything from university rankings to trending Google searches, retail sales, online movie reviews, and crime statistics.

Reddit R/Datasets – A vast collection of datasets submitted by users of the online community site Reddit covering literally hundreds of subjects.

Government and Inter-Governmental Organization Datasets

Data.Gov – Open data portal provided by the US government, hosting nearly a quarter of a million datasets published by all government agencies.

Data.Census.Gov – If you’re specifically looking for US demographic data, this is a good place to start!

Data.EU – The European Union's open data portal contains data from EU organizations and member state governmental data.

Data.gov.uk – Open data sets published by UK government agencies.

World Health Organization Data – Datasets related to global health and wellbeing.

World Bank Open Data – Datasets related to economic development, international financial markets, social indicators, and environmental issues.

Image Data

Google Open Images – Millions of images classified and labeled in various ways, suitable for training many different types of computer vision algorithms.

ImageNet Open Dataset – Another dataset consisting of labeled images that’s free to use for non-commercial machine learning applications.

COCO Dataset – Common Objects in Context (COCO) is a dataset consisting of over 200,000 images selected for training object detection and captioning algorithms.

Sound Data

Mozilla Common Voice – An open dataset of voice recordings that can be used to train any AI application that involves speech.

Audioset – Another Google-curated dataset, this one focusing on sounds and containing hundreds of thousands of 10-second samples broken down into categories such as musical instruments, vehicles, and vocals.

Million Song Dataset – Samples and metadata from one million contemporary popular music tracks.

Text Data

Wikidata – Database downloads of Wikipedia articles in a number of different formats.

Common Crawl – An open repository of data scraped from the world wide web, famously used to train the GPT large language models powering ChatGPT and many other chatbots.

Other and Miscellaneous Datasets

Amazon Reviews – A database of around 35 million reviews for Amazon products, including product information and ratings.

Waymo Open Dataset – Alphabet’s autonomous driving subsidiary Waymo makes a huge amount of data collected via self-driving vehicles publicly accessible, including sensor data from cameras and LiDAR.

Apolloscape Dataset – More autonomous driving data, this time provided by Baidu’s open-source Apollo platform.

To stay on top of the latest on new and emerging business and tech trends, make sure to subscribe to my newsletter, follow me on Twitter, LinkedIn, and YouTube, and check out my books, Future Skills: The 20 Skills and Competencies Everyone Needs to Succeed in a Digital World and The Future Internet: How the Metaverse, Web 3.0, and Blockchain Will Transform Business and Society.

---------------------------------------------------------------------------------------------------------------

About Bernard Marr

Bernard Marr is a world-renowned futurist, influencer and thought leader in the fields of business and technology, with a passion for using technology for the good of humanity. He is a best-selling author of 21 books, writes a regular column for Forbes and advises and coaches many of the world’s best-known organisations. He has over 2 million social media followers, 1.7 million newsletter subscribers and was ranked by LinkedIn as one of the top 5 business influencers in the world and the No 1 influencer in the UK.

Bernard’s latest books are ‘Business Trends in Practice: The 25+ Trends That Are Redefining Organisations’ and ‘Future Skills: The 20 Skills and Competencies Everyone Needs To Succeed In A Digital World’.

The Intelligence Revolution

619,371 followers

+ Subscribe

Sujith M.

Product Manager

11mo

Thank you for aggregating the different data sources Bernard Marr.

Akbar Ameghee

Hi. Dear. B. Ad just can I question have your teem. Any body is say to me what can I do . This is my problem isn't but also look like me from meelyard.oposit little word what is . Gpt where is adjust that need to do for me

Ambuj Saxena

Scaling up Social Buzz!

This is a mighty good newsletter edition for anyone who likes to kick off his/her AI journey

Anjuman Laghari

Thanks for sharing

Richard Lewis

Chief Art of the Possible Officer

Jake Rankin, PhD Eloisa Paver Caroline Fung Jamie Di Cataldo

2 Reactions

See more comments

To view or add a comment, sign in

See all

20+ Amazing (And Free) Data Sources Anyone Can Use To Build AIs

Bernard Marr

📖 Internationally Best-selling #Author🎤 #KeynoteSpeaker🤖 #Futurist💻 #Business, #Tech & #Strategy Advisor

The Intelligence Revolution

619,371 followers

More articles by this author

Insights from the community

Others also viewed

5 Questions to Ask About Algorithmic Businesses #data #decisionmaking #ArtificialInelligence #AI

AI/GenAI to analyze contact center volume drivers

Is your data AI-ready?

I can't see the random forest for the trees

Unlocking the Future of Data Products: Business-focused AI Agents team

"Turn a Sea of Data Into Data You Can See"

How to prepare your company for AI: 3 step journey

InterSystems IRIS – the All-Purpose Universal Platform for Real-Time AI/ML

AI Trends for 2023 in Voice of the Customer

Top 15 Most Common Data Quality Issues (and how to fix them)

Explore topics

The Intelligence Revolution

619,371 followers

Building Responsible AI: How To Combat Bias And Promote Equity

Jun 12, 2024

Navigating the Future of Work: The Role of Generative AI in Enhancing HR and Digital Labor

Jun 10, 2024

The 20 Generative AI Coding Tools Every Programmer Should Know About

Jun 9, 2024

Putting Generative AI To Work Inside The Enterprise

Jun 7, 2024

How Generative AI Will Change Jobs In Healthcare

Jun 5, 2024

Can Generative AI Solve The Data Overwhelm Problem?

Jun 3, 2024

Generative AI, Quantum And Partnerships: The Amazing Highlights From IBM’s Think 2024

May 31, 2024

The Crucial Difference Between AI And AGI

May 29, 2024

Why AI Won’t Take Over The World Anytime Soon

May 27, 2024

11 Barriers To Effective AI Adoption And How To Overcome Them

May 26, 2024

Insights from the community

Others also viewed

5 Questions to Ask About Algorithmic Businesses #data #decisionmaking #ArtificialInelligence #AI

AI/GenAI to analyze contact center volume drivers

Is your data AI-ready?

I can't see the random forest for the trees

Unlocking the Future of Data Products: Business-focused AI Agents team

"Turn a Sea of Data Into Data You Can See"

How to prepare your company for AI: 3 step journey

InterSystems IRIS – the All-Purpose Universal Platform for Real-Time AI/ML

AI Trends for 2023 in Voice of the Customer

Top 15 Most Common Data Quality Issues (and how to fix them)

Explore topics