| Abstract
Widespread use of Web and Distributed Geographic Information Systems (WGIS/DGIS) has facilitated the sharing of spatial databases among organizations and users over the Internet. However, interoperability among these systems and applications remained a problem until the recent publication of the Geographic Markup Language (GML) standard by OGC. GML is a new implementation specification that offers a vendor-neutral framework for the definition of geo-spatial application schemas and objects as eXtensible Markup Language (XML) documents. XML is quickly becoming a de facto standard for electronic data exchange among the Web applications and we expect the same trend with GML for WGIS applications. However, the conventional query languages in their current form are not suitable for direct query and update of XML repositories. The recent publication of XQuery standard by W3C offers a powerful standard query language for XML, however it lacks the support for spatial queries. In this study we extended XQuery language to support spatial queries over GML documents. We describe this language in detail and provide representative spatial query examples using GML-QL. Key Words: XML Query languages, GML, Spatial Query Languages |
|
The eXtensible Markup Language (XML) has become the widely adopted standard for representing structured and semi-structured data in scientific and business domains. XML provides a flexible and standard data exchange format for the Web. The attractiveness of XML as a data exchange format comes from the fact that XML makes it possible to define tags for marking up the document to describe the real structure; in other words, XML offers schema. Document Type Definition (DTD) is the simplest schema language that is widely in use. The core of DTD consists of definitions for elements and attributes. XML Schema [25] is a new recommendation of World Wide Web Consortium (W3C), which brings more expressive power than DTD and is expected to eventually replace DTD. The purpose of schema is to define a class of XML documents that correspond to a particular structure. Geographic Markup Language (GML) is one such example. The GML [17] is an XML encoding for the transport and storage of geographic information, including both the spatial and non-spatial properties of geographic features. The first version of GML was based on DTD. The recent standard is based on XML Schema and consists of schema definitions for geometry, feature and Xlink. Although GML solves the basic problem of data exchange and integration over the Internet, it brings into focus several new issues such as how to specify queries over GML documents, how to efficiently represent the data (native representation vs. ORDBMS), how to transform queries (Wrapping, Views, etc.), how to specify (enforce) integrity constraints etc., which need to be solved before the technology becomes practical. The first requirement is to develop a suitable language for specifying spatial queries over GML documents. In this project we have studied several XML query languages and identified key language constructs needed for specifying spatial queries. A previous study by Shekhar et al. [21] focused on the integration of GML into an interoperable WebGIS, and on client-side visualization (GMLView). It also brought into focus the need for spatial indexing schemes and provided algebraic cost models for query processing. Recently Corcoles et al. [7] have proposed a spatial query language over GML using extended SQL syntax. Such an approach requires implementation from scratch and also does not align with the current XML query standard. Direct extension of SQL is a good alternative; however, efforts are still underway to define an SQL/XML [9] standard by the InterNational Committee for Information Technology Standards (NCITS) and International Organization for Standardization (ISO) database committees. Thus our primary objectives in this study are to: i) evaluate existing XML query languages and their usefulness for spatial querying, and ii) extend one of the existing (and most suitable) native XML query languages for spatial (GML) databases.
The rest of this paper is organized as follows. In Section 2 we review existing XML Query languages and explain in detail the GML-QL. In Section 3 we identify key language constructs required for spatial queries and provide sample spatial queries using GML-QL. Finally we provide conclusions and future research directions in Section 4. In Appendix we provide schema for sample GML dataset.
We start with a brief literature survey and identify the language constructs needed for specifying spatial queries over GML documents. The Definition of XML query languages began very recently with the W3C initiative in 1998. That year, W3C organized a workshop (QL'98 [22]) whose aim was to identify the problems and opportunities involved in creating or adapting a query language capable of handling XML data. The workshop attracted over 66 position papers from a diverse group of researchers and industries. The resulting proposals can be broadly grouped into three approaches: XSL extensions, XQL (XML Query Languages), and a mix of SQL/OQL dialects. We studied in detail the XML-QL [8] and XQL [18] proposals, as they have received a lot of attention and generated considerable research interest in the last two years. An XML Query Language (XQuery) working draft was published recently [23], which describes a new query language called XQuery. Understanding of this specification is important, as it provides the basis for future XML query processing systems. We specifically looked into Quilt [6] language as it is the key resource behind the current XQuery standard. The recent ACM SIGMOD Record [1] consists a special section on advanced XML data processing. This issue gives a snapshot of the current state of the art in XML data processing including querying, bench marking and versioning, and brings into focus various research challenges. Key insights into the query algebra and its semantics for XQuery are provided in [11]. A general technique for querying XML documents using a relational database system is given in [19]. With the large amounts of data contained in GML documents, data storage management has primary importance for efficient query processing. There are two basic approaches to solve this problem. One is to develop a native XML database system (like Lore [13], Niagara [15]), and the other is to extend the relational and object-relational databases systems ( [19], [2]). Both approaches have advantages and disadvantages. One of the reasons that prompted us to choose native XML query languages is that, unlike commercial applications (e.g., e-commerce), Web-based geographic information systems (WebGISes) are not based on RDBMSes.
We have evaluated XML-QL [8], XQL [18], Quilt [6] and XQuery [23] languages for their usefulness in specifying spatial queries. These languages offer powerful tag-based document structure manipulation and join mechanisms, but provide very limited content processing capabilities. However, GML documents contain complex elements like spatial (geometry) and non-spatial attributes and require additional constructs (either built-in or user defined) in the query language to process and interpret these elements. In addition, the conventional data models and XML data models differ significantly. Tree or event models are often used to process/parse XML documents. In the following section we summarize the current state of the art in XML Query languages.
2.1 Comparison of XML Query Languages
Before proposing any new language for GML, it is necessary to evaluate and compare existing XML query languages. Bonifati and Ceri [4] have done a comparative analysis of five XML query languages: LOREL, XML-QL, XML-GL, XSL, and XQL. The comparison was based on an exhaustive set of desired language features such as data models, query abstractions, path expressions, quantification, negation, reduction, restructuring, aggregation, nesting, set operations, order management, typing and extensibility, support for advanced XML features and update capabilities (insert, delete, and update). This analysis shows that LOREL and XML-QL are powerful languages. A recent survey by Bonifati and Lee [5] summarizes several schema and query languages. This study includes XQuery in addition to the other five languages. XQuery is also the proposed standard by W3C and is mostly drawn from Quilt. Quilt itself is a unification of concepts from several languages and specifications including XPath, XQL, XML-QL, SQL and OQL. Our study also shows that XQuery is the most comprehensive of all the languages listed above. We choose XQuery because: i) it is more powerful than other languages, ii) it offers flexibility to specify complex queries (involving various types of joins), iii) it can be extended (though user-defined functions), and, last but not least, iv) it is the current standard.
Apart from the rich set of features explained above, these query languages should also have other desirable qualities. There is no universal standard for defining the desired qualities; often they are subjective. We chose the same set of desirable qualities provided in [4,5], namely, declarativeness, expressive power, and ease of use. Our evaluation also shows that XQuery is comparable or better than other XML query languages in terms of these qualities. We now briefly describe XQuery language and provide sample non-spatial queries over a sample GML dataset.
XQuery is a functional language. An XQuery is simply composed of expressions. There are seven types of expressions in XQuery language. They are path expressions, FLWR expressions, element constructors, conditional expressions, quantified expressions, expressions involving functions and operators, and expressions for testing and modification of datatypes. The core of the XQuery language consists of the FLWR expression, which is constructed from FOR, LET, WHERE, and RETURN clauses. The FLWR expression forms the skeleton of a query statement and is analogous to SELECT-FROM-WHERE statement in SQL.
FOR The FOR clause uses XPath expressions to bind the values
of one or more variables. Each of these expressions returns a set of nodes,
and the FOR clause generates an ordered list of tuples, each containing
a value for each of the bound variables. The order is determined by the
order of the bounded elements in the input document. The first bound variable
takes precedence, followed by the second bound variable, and so on. Preserving
the same order as the original document is important in many applications.
However, if any of the expressions used in a FOR-clause is unordered, then
the tuples generated by the FOR/LET sequence are unordered. When a node
is bound to a variable, its descendant nodes are carried along with it.
For example, the following query returns all schools from schools.xml.
FOR $s IN document(``schools.xml'') //School RETURN $s |
LET The initial FOR-clause in a FLWR expression can be followed by one or more LET-clauses and additional FOR-clauses, which provides bindings for additional variables. The LET clause binds a variable to the value of an expression. The main difference between a FOR and LET-clause is that the variables bound by a FOR-clause consist of a single node (along with its descendants). On the other hand, the variables bound by a LET-clause may represent collections of nodes (i.e., an ordered forest). Often a LET-clause is used when we need to store a set of values, which then can be used with some aggregate function (e.g., avg()). The following query generates the average number of students in a school.
LET $s := document("schools.xml") //School
RETURN <avgstudents> avg($s/pupils) </avgstudents>
|
When we want to apply some aggregate function on a set of groups, then the FOR and LET-clauses can be combined to generate those groups. For example, the following query generates the average number of students in each school district.
FOR $sd IN document("schools.xml") //SchoolDistrict
LET $pd := avg(document("schools.xml") [SchoolDistrict = $sd]/school/pupils
RETURN $s <SchoolDistrict> <gml:name> $sd/text() </gml:name>
<AvgStudents> $pd </AvgStudents>
</SchoolDistrict>
|
WHERE The WHERE clause is used to further filter out the tuples
generated by the FOR and LET clauses. Tuples that satisfy all the conditions
in the WHERE clause are then available in the RETURN clause. For example,
the following query generates those school districts where the number of
schools is greater than 5.
FOR $sd IN document("schools.xml") //SchoolDistrict
LET $pd := document("schools.xml") [SchoolDistrict = $sd]/school
WHERE count($pd) > 5
RETURN $s <SchoolDistrict> <gml:name> $sd/text() </gml:name> </SchoolDistrict>
|
RETURN The result set is constructed in the RETURN clause. The result could be a primitive value, a node, or an ordered forest of nodes. The basic syntax of XQuery language is as follows:
FOR for-expression (like XPath expression) LET let-expression [[FOR for-expression] [LET let-expression] ..] WHERE predicate [[AND | OR | NOT] predicate] RESULT element constructors (using the resultant of FLW-expression) |
The for-expression in FOR-clause is of the form <Variable> 'IN' 'DISTINCT'?
expression, where expression is a constant, or a variable, or an element
constructor, or a function, or an XPathExpression, or another FLWR-expression.
Thus we can construct sub-queries using nested FLWR-expressions. XPath
Expressions are based on XPath [24]
abbreviated syntax, which is briefly explained below.
| . | Denotes the current node | |
| .. | Denotes the parent of the current node | |
| / | Denotes the root node, or children of the current node | |
| // | Denotes descendants of the current node | |
| @ | Denotes attributes of the current node | |
| * | Denotes "any" (node with unrestricted name) | |
| [] | Encloses a Boolean expression that serves as a predicate for a given step | |
| [n] | When a predicate consists of an integer, it serves to select
the element with
the given ordinal number from a list of elements. |
Element constructors are used to generate element nodes. It consists of a start tag and an end tag, enclosing an optional list of expressions that provide the content of the element. The following example constructs a new school element.
<School>
<gml:name> $name </gml:name>
<address> $address </address>
<gml:location>
<gml:Point srsName="http://www.opengis.net/gml/srs/epsg.xml#4326">
<gml:coord><gml:X>50.0</gml:X><gml:Y>30.0</gml:Y></gml:coord>
</gml:Point>
</gml:location>
</School>
Expressions can be constructed using infix and prefix operators, and can be nested using parentheses. XQuery supports the usual set of arithmetic and logical operators, and the collection operators UNION, INTERSECT, and EXCEPT. Collection operators are applicable only on nodes. IF-THEN-ELSE conditional expressions were also supported in XQuery. XQuery supports both inner and outer joins (left, right, full). One of the important and necessary features for any language is to support functions that can be used with query languages. XQuery supports an extensive set of built-in functions. The XQuery function library consists of XPath functions, aggregate functions of SQL (avg, sum, count, min, and max), and several utility functions like distinct, empty, etc. The present XQuery standard supports user defined functions in a limited way. It is expected to provide an extensibility mechanism through which users can define functions in any programming language. These two features are essential requirements for supporting spatial queries.
3. GML-QL: A spatial query language
Spatial extensions to SQL have been studied extensively before. Spatial
SQL [10] presents an extension
of SQL for spatial query and visualization. GeoSQL [12]
is another extension for object-oriented GIS. Recently SQL/SDA [14]
presents an extension of SQL for spatial data analysis. Spatial queries
can be broadly classified into three categories [3]:
The basic syntax of the GML-QL language is same as the XQuery language. This extension allows the application of spatial operations defined in OGC [16] in the FLWR-clause as appropriate.
3.1 GML-QL examples
This section provides spatial query examples chosen from
[20] using SQL3 and GML-QL. These
queries are based on the assumption that the OGIS datatypes and operations
are available in SQL3, and can be incorporated into GML-QL as outlined
in the above section.
Example 1: List the name, population, and area of each country (Illustrates usage of selection and spatial computation).
| FOR $c IN document("Country.xml") //country,
RETURN <Country> <gml:name> $c/Name/text() </gml:name> <pop> $c/pop </pop> <area> Area($c/shape) </area> </Country> |
The corresponding SQL query is given by:
| SELECT C.Name, C.Pop, Area(C.Shape) AS "Area"
FROM County C |
| FOR $r IN document("River.xml")//river,
$c IN document("Country.xml")//country WHERE Cross($r/Shape, $c/Shape) == 1 RETURN <CountryRivers> <cname> $c/Name </cname> <rname> $r/Name</rname> <length> Length(Intersection($c/Shape, $r/Shape) </length> </CountryRivers> |
The corresponding SQL query is given by:
| SELECT R.Name, C.Name, Length(Intersection(R.Shape, C.Shape))
AS "Length"
FROM River R, Country C WHERE Cross(R.Shape, C.Shape) == 1 |
| FOR $r IN document("River.xml")//river[Name ="St. Lawrence"],
$c IN document("City.xml")//city WHERE overlap(buffer($r/Shape,300), $c/Shape) == 1 RETURN <CityName> <cname> $c/Name </cname> </CityName> |
The corresponding SQL query is given by:
| SELECT Ci.Name
FROM City Ci, River R WHERE Overlap(Ci.Shape, Buffer(R.Shape, 3000)) == 1 AND R.Name = "St. Lawrence" |
4. Conclusions and Future Research Directions
In this study we have evaluated several XML query languages and identified key features associated with these languages that are required for specifying spatial queries over GML documents. The native XML query languages are rich in tag based manipulation of the documents, but are weak in content manipulation. This study shows that non-spatial queries can be directly answered with XQuery, one of the most powerful and standard XML query language. However, spatial queries require a rich set of topological predicates and spatial analysis functions. Thus we have extended the XQuery to support spatial query operations. The resulting language, called GML-QL, supports the OGC Simple Features Specification for the SQL standard. We have provided a sample set of complex spatial queries in SQL3/OGIS and GML-QL which demonstrates the usefulness of the language.
This study brings to light several key research issues that need to be addressed before the GML-QL language can be used in a WebGIS environment. The main issue is related to the management of XML data itself - native XML databases or object relational databases (ORDBMS). Given the complexities of implementing spatial predicates and spatial analysis functions on semistructured data, it seems logical to choose ORDBMS for GML data management. Also, the recent trend shows that spatial extensions of ORDBMS based on the SQL/OGIS standard are becoming commercially available and have been adopted in several organizations dealing with spatial data. The relational completeness of GML-QL and SQL/OGIS allows the mapping between these two technologies. However, further research is needed for efficient translation and mapping between GML/GML-QL and ORDBMS/SQL (for both data and query). Our future study will be aimed at solving some of these issues.
This research has been supported through cooperative agreement with NASA (NCC 5316) and by the University of Minnesota Agriculture Experiment Station project MIN-42-044. I would like to thank my fellow project associate Ajay B. Pandey for his continued help and technical inputs. Technical inputs and critical comments of Prof. Shashi Shekhar, Prof. Tom Burk and Prof. Jaideep Srivastava have greatly improved my understanding of XML Query languages and the overall quality of this paper. The comments of Kimberly Koffolt have greatly improved the readability of this paper.
In this section we briefly describe GML data types and provide a sample fragment of GML document.
GML Data types
The GML is defined over the geometry model for simple features specification
[16]. This model has an abstract base
Geometry class and associates each geometry object with a spatial reference
system (SRS) that describes the coordinate space in which the object is
defined. The base Geometry class has subclasses for Point, Curve, Surface
and Geometry Collection. GML defines three base schemas for encoding spatial
information. They are: the geometry schema (geometry.xsd) which defines
the simple feature geometry model, the feature schema (feature.xsd) which
defines the general feature-property model, and the Xlink schema (xlinks.xsd)
which provides the XLink attributes used to implement linking functionality.
These schema documents alone do not provide a schema suitable for constraining
data instances; rather, they provide base types and structures which may
be used by an application schema. An application schema declares the actual
feature types and property types of interest for a particular domain, using
components from GML in standard ways. In accord with the OGC Simple Features
model, GML provides geometry elements corresponding to the following geometry
classes: Point, LineString, LinearRing, Polygon, MultiPoint, MultiLineString,
MultiPolygon, and MultiGeometry. In addition there are <coordinates>
and <coord> elements for encoding coordinates, and a <Box> element
for defining extents. An example of Point element is provided below:
| <Point gid="P1" srsName="http://www.opengis.net/gml/srs/epsg.xml#4326">
<coord><X>56.1</X><Y>0.45</Y></coord> </Point> |
Each <Point> element encloses either a single <coord> element or a <coordinates> element containing exactly one coordinate tuple; the srsName attribute is optional since a Point element may be contained in other elements that specify a reference system. Similar considerations apply to the other geometry elements. The Point element, in common with other geometry types, also has an optional gid attribute that serves as an identifier. Further details can be found in [17].
Schema for Country, City and River dataset
The database consists of three entities: Country, City, and River.
The simplified schema is given below:
| Country( Name: varchar(35), Shape: Polygon) | |
| City( Name: varchar(35), Country: varchar(35), Pop: integer, Shape: Point) | |
| River( Name: varchar(35), Shape: LineString) |
We provide schema of City.gml below. The schema for Country and River
are similar and can be easily defined using the corresponding relational
schema.
| <?xml version="1.0" encoding="UTF-8"?>
<!-- File: city.xsd --> <schema targetNamespace="http://www.opengis.net/examples" xmlns:ex="http://www.opengis.net/examples" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:gml="http://www.opengis.net/gml" xmlns:xsd="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" version="2.01"> <sequence>
|
Sample GML document
Here we provide a fragment of City.gml document.
| <city>
<xsd:Name>Havana</xsd:Name> <xsd:Country>Cuba</xsd:Country> <xsd:Pop>2.1</xsd:Pop> <gml:Shape> <gml:Point gid="P1" srsName="http://www.opengis.net/gml/srs/epsg.xml#4326"> <gml:coord><gml:X>101.8</gml:X><gml:Y>103.4</gml:Y></gml:coord> </gml:Point> </gml:Shape> </city> |