Picture of a server network

Dealing with Big Data

Picture of a server network

Since we’re creating 3D visualizations from large datasets, this raises a very interesting larger question of data management for projects like ours. We (the world at large, and our project in particular) are moving into the era of “Big Data”. As you may know, NSF now requires a data management plan with all proposals (our proposal was submitted prior to this requirement). At the time when we submitted the proposal, I would have said, truthfully, that the data we are using comes from a variety of sources that follow standard archiving and preservation practices, and that we would not duplicate those efforts.

As we’ve proceeded and the scope of the project has grown, we are producing higher order datasets compiled from numerous sources, and eventually “data products” (the visualizations). Our project is making a considerable investment of effort in these datasets and products, as we develop and then undertake workflows for identifying, transferring, extracting, merging, calibrating, and doing quality control on the topography, image, and related data. Due to the way these data are collected and provided (by many different agencies and organizations, with no uniform standard), it’s difficult to generalize the process even from one location to the next. However, we are learning a great deal about the kinds of issues that arise and will arise in the future.

We brought this topic up at the KeckCAVES meeting on Friday, and there is general agreement that we should take a broad approach to data management for the Lakeviz and related projects, likely by establishing a data server for the project, as a team member suggested. I also suggest that we investigate whether it is worth using the computer science practice known as version control. Our technically astute audience is likely familiar with this concept for software – you can easily find out that you have Microsoft Word version x.yy.z on your computer, for example. The same idea can be applied to other documents, including a Lakeviz dataset. However, we might not want to use version control software for a large binary dataset like the Lakeviz compilations: we don’t want to archive complete versions of these large datasets whenever we make a small change. More likely, we’ll assign a version number to the datasets when we make a major change, and we’ll keep looking out for good data management practices to apply to our project.

-Post adapted from Louise Kellogg, PhD. KeckCAVES Director

Leave a Reply