Tukwila Overview

Copyright (C) 2002-2003 by Zachary G. Ives
Original version 9/2/2002 - last updated 2 /19/2003

Use of Tukwila

Windows Execution Engine

Go to the Tukwila directory and run Tukwila.exe.  All DLLs should be in the same directory.

GUI

The GUI can be run with:

./TukwilaGUI [{hostname}]

where {hostname} defaults to localhost and is the name of the machine running the execution engine.

To test your connection, open "Maya1.xq" or "GS.xq" and run it.

Web Status Monitoring

Go to http://{hostname}:7777 to see the system status.

Code Overview

Execution Engine Core

Memory/Page Management

The memory management system is divided into several components.  The core component is CMemoryManager, which is provided with a pool size (currently specified in tukwila_config.xml) and grants memory to the other managers in terms of extents, via the requestExtent() and freeExtent() calls.

Extents are used by the following components:

CBufferManager, which creates the buffer pool and grants portions of the pool to implementors of the IRelation interface, currently descendants of CRelationBase:
  • CXMLRelation is responsible for buffering and retrieving XML content.  It encodes this content in compressed form.  The compression is handled by internal hash tables and a tokenizer.  Note that these structures aren't currently persistent, nor is the CXMLRelation itself.
  • CFlatRelation is responsible for buffering and retrieving standard tuples.  In general, this is how we 
  • CMappedFile is a memory-mapped file, useful for nested loops joins and such.  It adds a variety of additional functions to CFlatRelation, for instance the ability to search based on a key.  These are basically equivalent to the features supported by CHashTable.

RelationBase's descendants are assumed to have a single reader and a single writer.  This could be changed by switching from CMutex to CSem, but CSem is currently untested.

Internally, there is a RelationManager that creates and tracks mappings between relation IDs and paths.

CHashTable, which directly allocates a hash table from the memory manager and allows explicit control of buckets.  There are functions to find tuples, replace them, etc.

Hash tables are not thread-safe; the assumption is that all reads and writes will be synchronous.

CPageManager, which requests a group of pages from the buffer manager and allows for indirect access to them.  Basically, the role of the PageManager is to reserve a fixed portion of the buffer pool for use by an operator (instead of relying on the default LRU replacement of the pool).

Typically, CPageManager itself isn't directly used -- instead, its descendant, CTupleManager, is.  CTupleManager allows us to directly read and write tuples to/from pages that were allocated in a block.

A particular user of CTupleManager is the external sort, implemented in CTupleSorter.

Data values

IOperand is the basic entity of manipulation, and it can be a computed expression or a primitive data value.  Both algebraic operators and data values should subclass from this interface.  Operands can be queried as to their return type, null-ness, and so on.

Note that null-ness is not preserved by an entity when it is written (to disk or to memory).  Instead, for efficiency reasons the tuple container should record and maintain null-ness of its attributes.  (This way each null only takes 1 bit of storage.)

IValue is the basic data value, and it has a sub-interface template called IPrimitiveType<T> that is used for primitive scalar types.

Specific classes include:

  • CInteger
  • CDouble
  • CBool
  • CString
  • CXML

Values generally implement:

  • read() and write() to memory and to files
  • assign() of values
  • get...Value() and get...ValueAt() to read and return values (optionally at pointers)
  • cmp() and cmpAt() to compare with values (optionally at pointers)
CTuple is the basic tuple.  It is currently not a subclass of IValue, but this could easily be changed.  The big question is what operators would return tuples (e.g., instead of relations), and how scalar-based operators would handle things when given a tuple.

Tuples are responsible for knowing which of their attributes are null, and they store and read this from memory.  They call the appropriate attribute containers to make sure nullness is carried down to the attribute, and read from it. 

Operators

IOperator is a child of IOperand, and thus every operator can return an operand that can be used in an expression (and similarly, every IValue can be an operand).  There are three "interface"-style classes that define the arity of an operator:
  • INullary
  • IUnary
  • IBinary

Additionally, there is an IBoolean interface for classes that return booleans (namely, boolean operators and comparisons).

  • Operators must implement computeTypeInfo(), which determines their return types and creates any internal data structures.  This function is given the schema of the tuple upon which it can operate.
  • Operators must implement rebind(), which is given a tuple that can supply values to variables.

Subclasses of IOperator include:

  • IBoolean
    • CUnaryBoolean
    • CBinaryBoolean
    • CComparison
  • CUnaryArithmetic
  • CBinaryArithmetic
  • CNamedVariable, which is a reference to a value within a tuple
IQueryOperator is the interface for all query operators.  Typically operators actually subclass one if its descendants, which are named according to the number of child operators:
  • ILeafOperator has no child query operators and is typically implemented directly.  Instances include tablescans, tuple producers, and x-scans.
  • IOneChildOperator has descendants:
    • IProjectingOperator affects the arity of the child operator's tuple (e.g., projection)
      • IAttributeAddingOperator adds precisely one attribute to the child's tuple (e.g., XML tagging)
    • IPassThroughOperator does not modify the child operator's arity (e.g., selection)
  • ITwoChildOperator is a binary query operator, e.g., a union or a join.

IQueryOperator is not currently a subclass of IOperator because it isn't clear that we want to interchange between tables and scalars.  However, it is possible to extend in this way (and in fact, there are datatypes such as GROUP that are reserved for such a feature).

The (standard) order of iterator execution goes:

  • open()
    • opSpecificOpen() does operator-specific initialization before metadata exists.
      It sets _initFirstTuple if setNewTuple() should be called on the first tuple (as opposed to having this called by opSpecificNext()).
    • createMetadata() determines the output metadata based on the child or any initialization
    • bindSyms() computes any necessary bindings (e.g. between select exprs and the schema)
  • next()
    • if complete then exit
    • if new plan then set new child
    • set _firstTuple
    • if _initFirstTuple then
      • setNewTuple() is called here
    • opSpecificNext() which should call setNewTuple() as necessary, e.g., on firstTuple()
    • setNewTuple() should call rebind() on any predicates
  • close() should deallocate any tuples created by the operator 

Tuple creation and deletion is done by the parent operator, after it has read schema information from the child.  setNewTuple() establishes the necessary bindings between the output tuple and the child tuple.

Current query operators include:

  • NSelect
  • NProject
  • NJoin (using any of several TableAccessors and JoinConsumers)
  • NMergeJoin
  • NStatsCollector
  • NGroup
  • NPseudoGroup
  • NRemap
  • NRootOp
  • NWebJoin
  • NIdentity
  • NDematerialize
  • NMaterialize
  • NMateralizePoint
  • NXAttribute
  • NXElement
  • NXLiteral
  • NXLookup
  • NXOutput
  • NXScan

Internal Query Optimizer/Re-Optimizer

The optimizer works on PlanNodes, which are higher-level representations than standard query operators.  In particular, a PlanNode encapsulates the following operations:

The main class is OptimizePlan.  The actual optimizer is split among ExecPlan.cpp (actual operator generation) and Optimizer.cpp (PlanNode optimization).  Each re-optimization creates a new dynamic programming enumeration; after optimization, only the actual nodes in the final plan are preserved.

Important methods in OptimizePlan are:

External Interface

AQPDaemon.cpp contains the server daemon (which now handles web requests, AQP/CQP requests, and also XML query plans).

Currently the daemon resides by default on port 7777.  Queries are triggered via SOAP over HTTP.  Status information can be obtained simply via HTTP requests (GET / will return the root status page).  Performance monitoring can be done over executing queries; by default, the dot package is used to obtain graph-formatted query plans in GIF or PNG.

Query.cpp is the main routine and includes keyboard I/O control (which will probably be deprecated and replaced by web forms for reconfiguring the system)

Pre-Optimizer

The pre-optimizer is a Java component that is mostly responsible for parsing and language-level rewrites.  It is not 100% compliant with the current XQuery specs, and in fact it does not even support multiple lexing modes (so some words are reserved).

The rewrites of the optimizer mostly involve the following: