Skip to Main Content

Finding and Evaluating Data: Data Basics

Data Basics

What is data?

Data consists of discrete values or units of information that can take many forms: numbers, words, characters, images, sound recordings, videos, among others. Data is anything that can be collected, stored, organized, and analyzed.

A data set is information that is collected, assembled, and organized–by someone–for analysis of an issue, phenomenon, or subject. It may contain many kinds of information: textual, numeric, images, sound, video, code, geospatial data in a variety of formats (CSV, XML, TIFF, PDF, etc.).

"Free" datasets and software may have limits on how you can use them.

Here are some of the basic types of licenses for datasets:

  • Public Domain: no intellectual property rights apply, no attribution required
  • Creative Commons (CC): the most common type of license; offer a variety of different licenses that grant different levels of permission
  • Open Data Commons (ODC): allow users to share, modify and use datasets with proper attribution
  • Community Data License Agreement (CDLA): collaborative licenses to enable access, sharing and use of data openly among individuals and organizations

"Open Source" is talking about the software involved: open source software (OSS) is freely available online for download and use; this term does not refer to a license of any kind. Some basic types of licenses for OSS include:

When you create a data set, you want to make sure that it is "good" data in that it is accurate, complete, organized, and ultimately, reusable.

Here are some tools that can help:

  • OpenRefine is a powerful free, open source tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data.
  • Awesome Dataset Tools on Github: provides links to a variety of tools you can use to label your data.