Access to Geographic Scientific and Technical Data in an Academic Setting

Bastiaan van Loenen
PhD candidate

Delft University of Technology
Delft, the Netherlands
b.vanloenen@geo.tudelft.nl

The paper is based on a Master of Science thesis carried out at the University of Maine, Department of Spatial Information Science and Engineering (Van Loenen, 2001).


Abstract
Data availability is a key issue affecting societyís social well being. Economic and legal scholars have argued that the current, relatively open, access to data environment in the United States is beneficial to advancing knowledge and the economy. However, we have little empirical evidence validating the extent to which various access policy environments do or do not contribute to the satisfaction of academic researchers or to the accomplishment of their project goals. Our research aimed to evidence support or lack of support of the broad conventions in the context of access to and use of geographic data for knowledge advancement purposes within the university research environment. We synthesized a set of 23 recommended access to data principles from recommendations set forth in the literature. An on-line questionnaire allowed us to gain sufficient information to determine whether recommended principles were adhered to in the acquisition of each specific dataset and whether scientists were successful in their use of each dataset. In this paper we present the results of the research for the principles "Adherance to Marginal Costs or Less" and "Adherance to Metadata Availability".
 


Introduction

Data availability is a key issue affecting societyís social well being. With widespread availability of information on the Internet and other media, abundant opportunities have come to search for scientific and technical gold in the ore of factual elements. The possibilities for discovery of new insights about the natural world with both commercial and public interest value are extraordinary (NRC 1999, 21-22). The academic community has taken advantage of the fast and inexpensive opportunities to share data and knowledge across digital networks. 

The characteristics of digital data(sets) and collections of data (databases) that make them easy to share, help to advance science but also may provide disincentives for collecting data; "If [information] can be infinitely reproduced and instantaneously distributed all over the planet without cost, without our knowledge, without even its leaving our possession, how can we protect it?" (Barlow 1994, 85). The reverse question is raised by people on the other side of the access to data issue: If access to data is overly constrained through legal or technological methods, how can we realistically use the data in advancing the well-being of society?

Some foresee that current relatively open access to data for academia will continue to exist and expand because "information wants to be free" (Stewart Brandís slogan cited in: Barlow 1994, 89). Others contend that the real future of the information age lies "in metering every drop of knowledge and charging for every sip" (Okerson 1996, 80). Most suggest models that balance between the two extremes (see e.g. Varian 1995, 201, Reichman and Samuelson 1997).

Pressure by the private sector to shift the legal balance by increasing the protection for databases through legislation (e.g. HR 354) and self- help measures (contracts, licensing and technological methods for limiting access) is threatening the ability of the scientific community to access data. Pressure by some local governments towards revenue generation from sales of data (NRC 1997, 6, Reichman and Samuelson 1997, 68), private funding of academic research (Nelkin 1984, 97, NRC 1997, 111, 132) and pressure by university administrators to generate royalties from the products of faculty (Reichman and Samuelson 1997, 68) are other developments decreasing or threatening to decrease access to data for academics using geographic scientific and technical data.

However, empirical data about academic access to scientific and technical data is scant. We have little empirical evidence validating the extent to which various access policy environments do or do not contribute to the satisfaction of academic researchers or to the accomplishment of their project goals. Economic and legal scholars have argued that the current, relatively open, access to data environment in the United States  is beneficial to advancing knowledge and the economy. Lopez for example found evidence that "U.S. academic sector players significantly benefit from the [open] dissemination policy of the U.S. federal government" (Lopez 1996, 210). The main objective of our research was to evidence support or lack of support of the broad conventions in the context of access to and use of geographic data for knowledge advancement purposes within the university research environment.

Research Method

Figure 1: Overview of Research Method
First we synthesized a set of recommended access to data principles from recommendations set forth in various study reports issued by the National Research Council or recommended in the academic literature that relate to policies for providing access to scientific and technical data. Whether or not these specific principles are adhered to, an assessment was made in each project of satisfaction by scientists with the principles actually followed in gaining access to specific datasets and whether goals were achieved. We hypothesized that geographic data sharing relationships are more productive for science if the recommended principles are followed. Further, we developed an on-line questionnaire to gain sufficient information to determine whether recommended principles were adhered to in the acquisition of each specific dataset and whether scientists were successful in their use of each dataset. The survey asked participants to share their experiences in accessing, using and disseminating datasets use(d) in their research projects. In order to be able to compare the various policy environments researchers are confronted with, we asked for several measures of productivity:

    1. factors of successful use of the dataset,
    2. impediments in the use of the dataset,
    3. task accomplishment of the dataset,
    4. satisfaction with the dataset, and
    5. contribution of the dataset to overall research objective accomplishment.

The responses were evaluated statistically. We tested whether datasets adhering to an open access policy do or do not contribute to a more productive research environment than datasets not adhering to an open access environment. For the latter three productivity measures we asked the respondents to assess their productivity on a 5 level scale (e.g. excellent, good, fair, poor, and non-existent). We used a t-test to test for statistical significance in these three measures of productivity. A t-test may be used to test a hypothesis stating that the mean scores on some variable will be significantly different for two independent samples of groups (Zikmund 1991, 504).

Furthermore, an assessment was made in terms of success or impediments in the use of the dataset. We used the chi-square test to address this statistically. The chi-square distribution provides a means for testing the statistical significance of contingency tables. This allowed us to test for differences in two groupsí distribution across categories (Zikmund 1991, 500).

Sampling group

The sample we strove for was members of the academic community who are employed by a university, either public or private, and who are conducting academic research using digital geographic data or a GIS in their work.

Our sample of researchers using geographic information was developed and drawn from three sources. The first group consisted of 619 academics listed as having interests in GIS on the web site of the University Consortium for Geographic Information Science (UCGIS). UCGIS is a non-profit organization of universities and other research institutions dedicated to advancing understanding of geographic processes and spatial relationships through improved theory, methods, technology, and data (http://www.ucgis.org). The second group consisted of 33 additional academics drawn from a URISA list of individuals with interests in geographic information science. URISA is a non-profit international association of information professionals with specific emphasis on applications in state and local government (http://www.urisa.org) . The third group consisted of 53 academic researchers with National Science Foundation (NSF) support that indicated an intent to use a GIS in their research work. These individuals were identified through a key word searches of the NSF website (http://www.nsf.gov). Only those researchers were selected whose research proposal was accepted in 1994 or more recently.

The total sampling group consisted of 705 academia using geographic data in their work.

The survey

305 of 705 people responded to the invitation to participate. 148 respondents (21% of 705) provided useful responses for this research. 157 respondents indicated that they did not have time to fill out the questionnaire, were not accomplishing academic research, or did not use geographic information (systems) in their research.

Sources of Datasets in Questionnaire
Table 1: Datasets Addressed in Questionnaire per Source

Disciplines of Respondents

Because a broad spectrum of disciplines use geographic data in scientific research, one would suspect that the data provided by our sample may be indicative of the responses across many research domains due to the cross disciplinary nature of our sample. The majority of respondents indicated to work in the field of GIS, Surveying, Remote Sensing, or Photogrammetry (all together 19%). Other major fields were geography (15%), Ecological research (10%), Earth Sciences (9%) and planning (9%).

Distribution of Datasets

The on-line questionnaire asked participants to fill out each question for at least one and at the most three specific datasets.  The questionnaire was filled out for 290 datasets. Table 1 shows the distribution of the sources of the datasets. The majority of the datasets (75%) in this research came from a public source. Here we present the results of the analyses of the principles "Adherance to Marginal Cost or Less" and "Adherance to Metadata Availability" for geographic data obtained from federal, state and local government. First we present the principle, followed by  the statistical analysis of the two principles applying to government data. 

Principle: "Adherence to Marginal Cost or Less"

Scientific and technical data collected or maintained by or under authority of a government agency should be made available to all requesters at the marginal cost of dissemination or less.

One of the most prevalent access issues has been the pricing of public data. The value of (geographic) data comes from its use and restricting access to data by asking high (market) prices does not promote the use of data. It is assumed that the higher the price of the data, the less it will be used, and the less the value of the dataset in respect of advancing knowledge. Throughout the nineties many discussions have evolved around this issue (for a complete discussion see Onsrud 1992 a & b). In the context of this principle an open access policy typically would be a policy allowing access to data at marginal cost or less. We considered access to be at marginal cost or less when the data is free, the cost of dissemination is charged or a minimal statutory fee is charged by the supplier of the dataset.

A measure of adherence to marginal costs was established through an analysis of the question: What did you pay for the data? The highest ranking of adherence to marginal cost or less would be one with the following responses: no costs, cost of dissemination, or a statutory fee. The lowest ranking would be the responses market price, market price less a discount, full or partial cost recovery.

Price of a Dataset
Table 2: Price Respondents Paid for Data (table includes all sources i.e. government sources, private sources, non-profit sources and other sources)

171 of the government datasets qualified for the highest level of adherence as 23 did to the lowest.

We statistically tested wether datasets adhering to the principle did or did not contribute to a more productive research environment than datasets not adhering to the principle. The t-test provided conflicting results for the different measures of productivity. Respondents who acquired datasets at high costs were able to perform significantly more tasks with the dataset than respondents who accessed their datasets for marginal costs or less (at a level of significance of 0.10). However, respondents using "inexpensive" datasets were significantly more satisfied (at a 0.20 level of significance) and accomplished significantly more overall objectives (at a level of significance of 0.10). Maybe these respondents could use the funds initially meant for the acquisition of datasets for other elements important for the research project. Another explanation may be in the expectations of the respondents of the dataset. The (potential) user may have no or very low expectations for inexpensive datasets: any contribution to the research may satisfy the researcher. On the other hand the expectation of more expensive datasets may be higher. For these datasets users do expect contribution to the research. If not or less than expected, the satisfaction with the dataset may diminish.

We also performed a chi square test. We asked respondents to indicate what factors contribute to successful use of their dataset As measures of success we used for this principle the answer "the cost of the dataset". We found that our two groups are significantly not uniform (at a level of significance of 0.01). Respondents with datasets available at marginal cost or less indicated for 45% of the datasets that the cost of the dataset contributed to successful use of that dataset. In 0% of these datasets the cost was considered an impediment in the use. The group with the datasets available at high cost scored 52% for successful use and 9% for impediments. Also the chi square test suggests that the price of a dataset does not necessarily impact on the productivity of the academic researcher. However, the measure of success in the chi-square test focused on successful use of the dataset. The issue of money may not influence the use of the dataset since one first acquires and then uses the data.

The conflicting statistical information and the lack of background information that may have allowed us to explain the results better forces us to conclude that although the research suggests that the price of a dataset does not impact on the productivity of the academic researcher, further research, probably through alternative research methods (interview/ case-study research), is needed to explore the proposition further.

In our study we see that in most instances the respondents did access the data at low cost (see table 2). In the table we see that, according to our definition of access at marginal costs, one of the threats we mentioned in the introduction, revenue generation from the sales of data by local governments, does not seem to be put into practice (yet): 87% of the datasets coming from local government were obtained at marginal cost or less.

Price of the Dataset per Source
Table 3: Cost of the Dataset per Source (in percentages)

 

Principle: "Adherance to Metadata Availability"

Scientific and technical data collected or maintained by or under authority of a government agency should be documented adequately with metadata.

New technology is significant in that it creates an opportunity for people to access information previously unavailable. However, one needs to use the technology efficiently and effectively in order to take advantage of the opportunity. In order to "disseminate public information in an efficient, effective, and economical manner" (PRA 1995 (1) (C)) sufficient and appropriate hard- and software programs, standards to communicate between agencies and between agencies and requesters of data, and adequate documentation (metadata) to guarantee the quality of the dataset are required.

Metadata is data about data, such as where is it located, how is it collected and maintained and by whom, how can it be accessed and the characteristics of the data itself (McLaughlin and Nichols 1994, 71). The major uses of metadata are: to help organize and maintain an organization's internal investment in spatial data, to provide information about an organization's data holdings to data catalogues, clearinghouses, and brokerages, and to provide information to process and interpret data received through a transfer from an external source (FGDC 1997).

Adequate explanatory documentation or metadata can eliminate great barriers in the usage of scientific and technical data. It is one of the key components in the FGDC strategy to develop the National Spatial Data Infrastructure: "If you think the cost of metadata production is too high Ė you havenít compiled the costs of not creating metadata: loss of information with staff changes, data redundancy, data conflicts, liability, misapplications, and decisions based upon poorly documented data" (FGDC, 2000).
However, operational controllers may regard the additional costs of cleaning up and documenting the information they collect so that it can be shared with others as outweighing the benefits to be obtained by gaining access to other data sets (Masser and Ottens, 1999, 37).  Harvey, for example, found that
local government suppliers of geo-data not always recognize the documentation of metadata as being of significant importance  (Harvey 2001, 37).

In our research we established a measure of availability of adequate metadata through an analysis of the following question: How good was the documentation of the dataset? A highest ranking would have had the following answers: good or excellent documentation. 109 datasets qualified for the highest level of adherance. A lowest ranking would have had the following answers: fair, poor or non-existent documentation. 84 datasets qualified for this lowest level of adherence. A total number of responses of 193 was analyzed.

Quality of Documentation
Table 4: Quality of the Documenation per Source

The t-test showed that datasets with adequate documentation are for two measures of productivity (task accomplishment and satisfaction) more productive than datasets with inadequate documentation (at a 0.001 level of significance). Datasets with adequate documentation also allow significantly more overall objectives to be accomplished than datasets with inadequate documentation (at a level of significance of 0.01). Thus, there is a strong indication that datasets documented with adequate documentation are more productive to academic researchers than datasets with inadequate documentation.

We also tested the principle in a Chi square test. As measures of success we used the following answer to the question "Which of the following, if any, were significant factors in allowing you to successfully use this dataset?" adequate documentation or metadata for this dataset and to the question "Which of the following, if any, were significant impediments to your use of this dataset?" inadequate documentation or metadata for this dataset.
The two groups are significantly not uniform (at a 0.001 level of significance). The group with adequate documentation scored 51% for the success measure as only 7% of the datasets did in the other group. The datasets with adequate documentation also scored better on the impediments measure: 9% versus 32%. The chi-square test confirmed the findings of the t-test suggesting that the availability of adequate documentation allows significantly more successful use of a dataset than datasets lacking adequate documentation.

One may wonder what adequate documentation is. The responses to the question "Which of the following did the documentation of the dataset (digital catalogue files or metadata) help you accomplish?" provided us with background information on the documentation of a dataset. We used as a test for the sufficiency of metadata a positive response that at least three of the following features were addressed in the documentation of the data:

(1) technical suitability of the dataset,
(2) quality/ accuracy of the dataset,
(3) timeliness of the data,
(4) relevance of the dataset,
(5) contractual restrictions or other legal constraints to the use of the datasets, or
(6) allows users to find the dataset through a computer search.

This research provided evidence that academic users of government data highly value the existence of metadata. Moreover, the research showed that the productivity of the academic researcher with a particular dataset, in this research measured in task accomplishment with the dataset, satisfaction with the dataset and overall objective accomplishment with the dataset, is positively correlated to the existence of metadata. 

One way of guaranteeing the documentation of metadata is to require and fund metadata creation and appropriate archiving of research datasets in public depositories or libraries as standard conditions of grants.

Conclusions

This research explored current access policies imposed on researchers in U.S. universities that affect geographic scientific and technical data. Because a broad spectrum of disciplines use geographic data in scientific research, we suspect that the data provided by our sample may be indicative of the responses across many research domains due to the cross disciplinary nature of our sample.

Although the research suggests that the price of a dataset does not impact on the productivity of the academic researcher, further research, probably through alternative research methods (interview/ case-study research), is needed to explore this proposition further.

The study evidenced that in order to advance the progress of science, government agencies supplying geographic data should document their data adequately with metadata. However, determining the specific utility of metadata and which constituent components are most critical would require further investigation. 

Acknowledgments

The author aknowledges the financial support of the U.S. National Science Foundation (SBR-9700465), and the VSB-foundation in the Netherlands (http://www.vsbfonds.nl).

References

Barlow, John Perry, (1994), The Economy of Ideas, A framework for patents and copyrights in the digital age. (Everything you know about intellectual property is wrong.), 84-90, 126-129,WIRED 2.03 March 1994.

FGDC, (1997), Geospatial metadata.

FGDC, (2000), Ten most common metadata errors, FGDC Metadata Education Program, September.

Harvey, Francis, (2001), U.S. National Spatial Data Infrastructure, the Local Government Perspective, GIM International, March, 36-39.

Loenen, B. van, (2001), Access to Scientific and Technical Data in an Academic Setting, M.Sc. thesis, University of Maine, May.

Lopez, Xavier, (1996), The Impact of Government Information Policy on the Dissemination of Spatial Data, A Thesis Submitted in the Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy (in Spatial Information Science and Engineering), University of Maine, August.

Masser, Ian and Henk Ottens, (1999), Urban Planning and Geographic Information Systems, 25-42. Published in John Stillwell et al (eds.), Geographical Information and Planning, Springer-Verlay. 

McLaughlin, John and Sue Nichols, (1994), Developing a National Spatial Data Infrastructure, Journal of Surveying Engineering, Vol. 120, No. 2 May 1994, 64-76.

Nelkin, Dorothy, (1984), Science as Intellectual Property: Who Controls Scientific Research? AAAS series on Issues in Science and Technology. MacMillan Publishing Company, New York.

NRC, National Research Council, (1997), Committee on Issues in the Transborder Flow of Scientific Data, U.S. National Committee for CODATA, Commission on Physical Sciences, Mathematics, and Applications, National Research Council, Bits of Power: Issues in Global Access to Scientific Data, National Academy Press, Washington, D.C. http://www.nap.edu/readingroom/books/BitsOfPower/index.html.

NRC, National Research Council, (1999), Mapping Science Committee, Distributed Geolibraries, Spatial Information Resources, National Academy of Sciences, http://www.nap.edu/books/0309065402/html/

Okerson, Ann, (1996), Who Owns Digital Works? Computer Networks Challenge Copyright Law, But Some Proposed Cures May be as Bad as the Disease, Scientific American, July, pp.80-84.

Onsrud, H.J., In Support of Cost Recovery for Publicly Held Geographic Information. GIS Law, 1992, 1(2): 1-7 http://www.spatial.maine.edu/~onsrud/pubs/Cost_Recovery_for_GIS.html

Onsrud, H.J., In Support of Open Access for Publicly Held Geographic Information. GIS Law, 1992, 1(1): 3-6
http://www.spatial.maine.edu/~onsrud/pubs/In_Support_OA.htm

PRA (1995), Paperwork Reduction Act,  (1) (C)

Reichman, Jerome H. and Pamela Samuelson, (1997), Intellectual Property Rights in Data, Vanderbilt Law Review Vol. 50:51, 51-166.

Varian, Hal R., (1995), The Information Economy. How much will two bits be worth in the digital market place? Scientific American, September, 200-201.

Zikmund, William G., (1991), Business Research Methods, Chicago: Dryden Press, 3rd Edition