Tech Week: Online Databases and Data Sharing
It’s Tech Week on the Blog and the Technology Committee has something special in store.…
This post is part of the May 2012 Technology Week, a quarterly topical discussion about technology and historical archaeology, presented by the SHA Technology Committee. This week’s topic examines the use and application of digital data in historical archaeology. Visit this link to view the other posts.
At the Center for Digital Antiquity (Digital Antiquity), we are committed to improving access to and preservation and use of archaeological information. Over the past four years, we’ve built tDAR (The Digital Archaeological Record), a digital repository designed to preserve the digital documents, data sets, images, and other digital results of archaeological investigations and excavations. tDAR is one of a number of discipline-specific repositories designed from the bottom up to better support the needs of the content by providing rich, archaeologically-specific metadata along with tools to discover, access, and use the uploaded materials.
Looking into the crystal ball, there are a number of significant challenges and important opportunities ahead:
If there’s anything that we can learn from the basic practice of archaeology, it’s that things do not get preserved unless the environment is right to enable preservation. This works best if there are multiple sources and tools available. In the case of archaeological data, it means that there is a mixture of sustainable technology, organizations, and tools to enable and facilitate preservation.
A digital repository that has the ambition of providing long-term preservation for archaeological data must be sustainable for the long term. There must be a realistic plan for funding the variety of activities required in order to ensure access and preservation of information, as well as succession plans. These are core components of being certified as a “Trusted Digital Repository,” something that Digital Antiquity aspires to make tDAR in the near future.
At Digital Antiquity, we have a plan and a schedule for achieving it. We see the development of a digital curation service useful for public agencies, research organizations, and individual researchers as key to sustaining the tDAR repository. We plan to charge for the deposit of information into tDAR to support the archiving of those materials, and are negotiating with other archives to serve as backup repositories for tDAR. The main point here is that any organization that is serious about providing for long-term support to maintain must have a plan to ensure financial support and must work diligently to execute this plan.
Digital Antiquity cannot solve this problem alone, however, sustainability requires multiple sources, technologies and approaches tools like LOCKSS or organizations like the Internet Archive or HathiTrust to help ensure sustainable archaeological information. Sustainability also requires a change in culture. It requires that public agencies, research organizations, and individual researchers who create data ensure that it is available and remains preserved for future access and use, and budget funds as part of their activities to support the digital repositories.
One of the easiest ways to understand the challenges of the future is to look at the problems we’re still struggling with from the past. Looking back to the 70’s, 80’s, and 90’s tremendous quantities of archaeological data, in the form of reports, documents, data sets, and other materials have been produced. Most of this data collected in the US has been funded by public undertakings conducted through cultural resource management (CRM) investigations.
The challenge is that much, perhaps most, of this information is on the verge of being forgotten about and lost. Almost all of the reports from the CRM era are available only as paper records. Unless systematic efforts to preserve, digitize, and make more widely available these older reports and data are undertaken, this body of work will be forgotten or essentially lost.
Recently produced archaeological reports and other data often are in digital formats. However, if these reside only on a floppy disk they too are one step away from being lost. The digital analog to the situation with paper records is not much better: a broken hard-drive or a Dropbox account that’s been corrupted, and the critical data has been lost. When data is maintained and kept at the “personal” level without appropriate documentation and backup, it’s at risk.
With the advent of the web, some documents and databases have moved to the web as simple webpages or more complex websites. Moving to the web has been a major step forward enhancing discover and providing easier access. Tools like Google may enable these materials to be discovered and used, but not all databases are “discoverable.” For example, the NADB database has been hosted for a number of years by the Center for Advanced Spatial Technology (CAST) at the University of Arkansas. In this form, it was available online, but for potential users to use it, they had to know both about NADB and how to access the NADB web page in order to perform a search. Simply putting it on the web does not equate with accessibility.
From an archival standpoint, a database like NADB in its current form would not be preserved either. Services like the Internet Archive, attempt to archive sites, but only those that pages can be linked-to, and many databases are only accessible via search-forms. Furthermore, if they are accessible, the data is being preserved in a translated form – definitely better than not preserving the data at all, but not ideal.
The other challenge can be boiled down to a fundamental question… what will happen to the website in 20 years? Sites like Geocities or ma.gnol.ia are examples of what can happen to data on the web without stewardship. Software reaches end-of-life comparatively quickly (5 years in some cases), with backend software or hardware no longer supported — tools like Cold Fusion, early versions of Oracle, or older file formats such as Word Perfect are becoming more scarce, and harder to use / access. Over the next 10-20 years, these challenges will grow as computing continues to evolve. The growth of cloud computing has great potential: tools like Google Docs and online databases provide a myriad of features we could have only dreamed of in the past, but offer new challenges for preservation and use as they may be dependent on the tool, and restrict access for preservation or use. These too will have time and costs involved and will require online migration and future support.
Regarding use, within the United States there are federal and state regulations that prohibit the general availability of some kinds of archaeological information, specifically detailed site location information. This protection is critical to the management and preservation of the physical site. This, however, requires that online tools be sensitive to this information and that repositories develop methods for screening access and dealing this kind of information.
There are two aspects to consider: First, most information about archaeological resources need not be held as confidential. In our experience, documents of several hundreds of pages may have only a few with specific site location information on them and many reports do not have any of this kind of detailed information in them. The challenge, is to ensure that the goal of site protection does not endanger overall ability to preserve and provide access, something tDAR does by enabling documents to be marked as confidential (or enabling redaction), restricting access to the site location information, preserving it and making it discoverable, but restricting access.
The other aspect of this issue is how to ensure that those individuals and officials who need to have access to confidential information can get it? Issues of the identity of repository users will require that over time, tools are created to help in the management of identity and helping to vet users to migrate from each system managing separate credentials or requiring the initial uploader to validate all users.
With the advent of the web, real-time, large-scale collaboration has become feasible, and in many cases quite productive. It requires a shared knowledgebase and interest between the parties, as well as trust. Examples of collaboration range from NSF projects that span a country, or even the development of the state site-files. But, for these collaborations to work, significant synthesis work must be accomplished first, agreed-upon terms, definitions, archaeological and data standards, etc. Within the world of archaeology, this is problematic. There are definitely some categories of classification that can be agreed upon, from faunal characteristics, to scientific measurements, but many qualitative classifications do not have formal, agreed-upon, meanings. Furthermore, significant work must be done once data has been collected in order to prepare it for collaborative endeavors. But, for any of this to happen, there must be more data sharing and publication through tools like tDAR or Open Context.
The technology visionary dreams of the Semantic web and linked data, the world where data is infinitely accessible and any query can be answered with a quick search and a click of the mouse. One where data can be collated from multiple sources automatically to answer questions that were impossible otherwise. The dream of the semantic web is one where data is “free” of the database, there are no silos and data is interconnected in ways that the original creator could never conceive. The theory of the semantic web is that if you had online databases of various types linked together and available for users, that it would enable complex, advanced searching functionality that would link the multiple databases together in new, and unique ways.
The challenges of this, however, are great from data quality, to knowledge of external tools, to technical skill. The latter being, in some ways, the greatest challenge; Archaeologists, in general are a smart bunch, and often quite technically savvy, but these tools also have a high barrier to entry for use. Some of these barriers include:
In summary, none of these challenges are insurmountable, we have organizations dedicated to the preservation and use of digital data; and we have tools that are evolving to make it easier to ask and answer questions that we could only dream of in the past, linking data together and making new connections.
What we must work together to do is to continue to change the culture or archaeology to ensure that both legacy and new data is properly archived and preserved. And, the challenge for the technologists to build tools that empowers non-programmers to analyze and re-use data in new ways.