InfiniteGraph meets Stackoverflow

A few months ago I worked on a demo application for a client that was written in entirely in C++. I ended up running into an issue around Threads and since I hadn’t used the language in a while, I turned to the internet for help.  Many of my search results pointed at pages in a website called Stackoverflow.   For those of you that haven’t heard about it yet, it’s basically a community Q&A website devoted to and collaboratively maintained by programmers for programmers.  The site describes itself as a synthesis of popular information sharing taking the best of Wikis, Blogs, Forums and Digg/Reddit.  Not only did I find the specific answer to my question, but I also got to review answers and comments from others with related expertise and insight.    So I checked out the rest of the site and discovered a very healthy community of members along with a well organized and fast UI.   I promptly bookmarked it for future use.

So why do I mention this in a Graph Database blog?   Well recently I was looking for sample data sets for a demo I was building for our newly-launched InfiniteGraph product.   But I wasn’t all that interested in building another Facebook or Twitter graph example as I thought were a bit played out.  So I ventured back to Stackoverflow and after searching a bit, found a torrent pointing to a full dataset for the site (in XML format).  In all, the data set includes exports from the entire family of Stackoverflow sites including Stackoverflow,  Superuser,  Serverfault and Meta Stackoverflow.  Surprisingly, the data model is quite simple and represented completely by the following object types:

  • Users
  • Posts (marked as either Questions or Answers)
  • Comments
  • Votes
  • Badges

The interaction among the objects are:

  • Users can post either Questions or Answers
  • There can be multiple Answers for a Question; only one of which is marked by the owner of the Question as the “accepted” answer.
  • Users can provide Comments for both Questions and Answers
  • Questions and Answers can receive Votes by the community for their relevance/appropriateness/applicability and Questions can be marked as personal favorites
  • Badges are tags awarded to Users for their level of participation and contribution to the community and directly affect their reputation.  The higher the reputation the more capabilities awarded to the individual to make changes to the site

The backend datastore for the site appears to be RDMBS and confirmed by the following set of public queries allowing visitors to execute either prebuilt or custom SQL queries against the DB.

My graph model for the same set of data appears as follows: (sans the Votes and Badges objects which I chose to deliberately ignore for the purposes of my sample)

The vertices in the graph are represented as the Users, Questions and Answers above while the edges are represented as the interactions between them (i.e. a User “Posts” a Question,  an Answer is “For” a Question, a User “Comments On” a Question or Answer).  Simple enough, and like most other social graphs, users seem to be the focal points with the majority of connected edges.   Now all I needed was a sample application that could construct the graph data model from the XML sources and run some queries.

Not being satisfied with creating the usual command line interface application, I put my back-pocket UI skills to work and wrote Java-based application to visually demonstrate the results.   For those that care, the sample was constructed as an Eclipse RCP application using the Zest Eclipse visualization toolkit to draw the graph elements.   Sure, there are lots of open-source and commercial visualization toolkits that can accomplish the same thing, if not better.  But I wanted something that integrated nicely with Java in Eclipse and didn’t require an intermediate data format such as XML or JSON to display the graph.

The import process was fairly straightforward as I used a StAX parser to read the XML data into the application and construct the appropriate vertex and edges.  Once data was in the system, I took to the task of coming up with a few queries I thought might be interesting to run.  One of the most powerful ways to handle queries within InfiniteGraph is to construct a Navigator.  A Navigator allows you to run algorithms over the graph starting from a known vertex and provides a way to qualify both the paths explored as well as the results returned.   The last argument in the navigator call is a reference to a NavigationResultHandler which, as the name implies, allows you to take the qualified paths and consume them for display or other purposes.  The method signature is specified as follows:

Navigator navigate(Guide guide,
                   Qualifier pathQualifier,
                   Qualifier resultQualifier,
                   NavigationResultHandler resultHandler)

Initiates a navigation across the graph, starting from this vertex and defined by the rules provided.

Parameters:
guide - the guide directs the navigation, determining both which edges are to be followed and the order in which they are crossed.
pathQualifier - qualifier which determines if a path being followed by the navigator is valid or should be terminated.
resultQualifier - qualifier which determines if a path that has been followed constitutes a result of this navigation. The returned iterator will present all paths that are matched by this qualifier
resultHandler - a handler for receiving result paths (those matched by the supplied resultQualifier). When the navigator finds a result path, it will invoke the handleResultPath method on this reference. The reference must not be null.
Returns:
a Navigator instance which can be used to start the navigation process

My first query tried to answer the following question:  Show me all the questions on the site that have been answered by a specified user given the following conditions. First the question should have a minimum favorite count (i.e. the question was marked as a favorite question by at least N users).  Second, the answer to that question should be the “accepted” answer as marked by the question owner.    One may take this as a way to find users with a fair amount of credibility on a website or within a support organization by seeing how they answered the most popular questions.

So I constructed my navigator code as follows:


   public void run()
   {
      Transaction tx = null;
      try
      {
         // Start a transaction
         tx = getGraphDB().beginTransaction(AccessMode.READ);
         ...
         Navigator nav = startVertex.navigate(new CustomGuide(), new PathQualifier(), new ResultQualifier(),
               getResultHandler());

         // Start the navigation
         nav.start();
         tx.commit();
      }
      catch(StorageException e)
      {
         e.printStackTrace();
      }
      finally
      {
         tx.complete();
      }
   }

   class CustomGuide implements Guide
   {
      @Override
      public Strategy getHopOrder(VertexHandle currentVertex, List<Hop> order)
      {
         for (ListIterator<Hop> iter = order.listIterator(order.size()); iter.hasPrevious();)
         {
            Hop h = (Hop) iter.previous();
            if (h.getEdgeHandle().getTypeId() == commentsOnEdgeTypeId)
               iter.remove();
         }

         // Run a deep search on the priority branches first
         return Guide.Strategy.DEPTH_FIRST;
      }
   };

   class PathQualifier implements Qualifier
   {
      @Override
      public boolean qualify(Path currentPath)
      {
         return ((currentPath.size() <= 3) ? true : false);
      }
   };

   class ResultQualifier implements Qualifier
   {
      @Override
      public boolean qualify(Path currentPath)
      {
         if (currentPath.size() == 3)
         {
            Hop h = currentPath.getFinalHop();

            if ((h.getVertexHandle().getTypeId() == questionVertexTypeId))
            {
               Question question = (Question) h.getVertex();
               AnswerFor answerFor = (AnswerFor) h.getEdge();
               if ((question.getFavoriteCount() >= favQuestionCount) && answerFor.isAcceptedAnswer())
                  return true;
            }
         }
         return false;
      }
   };

A few notes about the code.

  • In the CustomGuide class, I remove any paths that start with the edge type of Comments On since I’m not interested in following them.  For large graphs, this will improve performance of the search by pruning the set of paths the navigator must follow.
  • For the PathQualifier class, I’m telling the navigator to continue searching for paths (hops) 3 or less deep.  Anything over that is not subject to evaluation.
  • The code in the ResultsQualifier does all the query criteria checking, again only looking at results that are at most 3 hops in size (e.g.  User -> Answer -> Question)

Running this query gave me the following graph in my application:

While the graph looked pretty, I wasn’t really excited about the query itself.  So I tried again, this time coming up with what I thought was a more interesting set of search criteria.  This time I wanted to answer the following question:  Show me all the questions answered by the given user which were tagged with certain keywords so I can find out if that user has a strong background in those topics.  This might be useful for IT organizations looking for staff which have skills in a certain area and match them with internal projects.  The number of answers returned would indicate their familiarity with the subject and thus categorize their experience level, ultimately driving their demand.

My navigator code resembled the following:


   public void run()
   {
      Transaction tx = null;
      try
      {
         // Start a transaction
         tx = getGraphDB().beginTransaction(AccessMode.READ);
         ... 
         Navigator nav = startVertex.navigate(new CustomGuide(), new PathQualifier(), new ResultQualifier(),
               getResultHandler());

         // Start the navigation
         nav.start();

      }
      catch(StorageException e)
      {
         e.printStackTrace();
      }
      finally
      {
         tx.complete();
      }
   }

   class CustomGuide implements Guide
   {
      @Override
      public Strategy getHopOrder(VertexHandle currentVertex, List<Hop> order)
      {
         for (ListIterator<Hop> iter = order.listIterator(order.size()); iter.hasPrevious();)
         {
            Hop h = (Hop) iter.previous();
            if (h.getEdgeHandle().getTypeId() == commentsOnEdgeTypeId)
               iter.remove();
         }

         // Run a deep search on the priority branches first
         return Guide.Strategy.DEPTH_FIRST;
      }
   };

   class PathQualifier implements Qualifier
   {
      @Override
      public boolean qualify(Path currentPath)
      {
         return ((currentPath.size() <= 3) ? true : false);
      }
   };

   class ResultQualifier implements Qualifier
   {
      @Override
      public boolean qualify(Path currentPath)
      {
         if (currentPath.size() == 3)
         {
            Hop h = currentPath.getFinalHop();

            if ((h.getVertexHandle().getTypeId() == questionVertexTypeId))
            {
               Question question = (Question) h.getVertex();
               String tags = question.getTags();
               if (tags.matches(skills))
                  return true;
            }
         }

         return false;
      }
   };

A few notes about the code.

  • In the CustomGuide class, I again remove any paths that start with the edge type of Comments On since I’m not interested in following them.
  • For the PathQualifier class, I’m telling the navigator to continue searching for paths (hops) 3 or less deep.  Anything over that is not subject to evaluation.
  • In the ResultsQualifier class I evaluate the tags properties using the regex string supplied in the UI.   In the original RDMBS the tags are stored in their own table.  But in the XML data dump from the site, the tags are stored as a single tokenized string as follows: <tag1><tag2><tag3>…<tagN>.

Running this query gave me the following qraph in my application:

For my last query, I decided to have a little fun and try a Six Degrees of Separation type search against the DB. The idea is to pick any two users on the site and find out what they have in common, exploring any path less than the specified degrees of separation (6 max in this sample). The user could also specify how many results to return as well. The following is my navigator code:


   public void run()
   {
      Transaction tx = null;
      try
      {
         // Start a transaction
         tx = getGraphDB().beginTransaction(AccessMode.READ);
         ...

         nav = startVertex.navigate(Guide.SIMPLE_DEPTH_FIRST, new PathQualifier(), new ResultQualifier(),
               getResultHandler());

         // Start the navigation
         nav.start();

      }
      catch(StorageException e)
      {
         e.printStackTrace();
      }
      finally
      {
         tx.complete();
      }
   }

   class PathQualifier implements Qualifier
   {
      @Override
      public boolean qualify(Path currentPath)
      {
         return ((currentPath.size() <= maxDegrees) ? true : false);
      }
   };

   class ResultQualifier implements Qualifier
   {
      int count = 0;

      @Override
      public boolean qualify(Path currentPath)
      {
         Hop h = currentPath.getFinalHop();

         if ((h.getVertexHandle().getTypeId() == userVertexTypeId))
         {
            if (((User) h.getVertex()).getUserId() == userId2)
            {
               if (count++ == maxPaths)
                  nav.stop();
               return true;
            }
         }

         return false;
      }
   } 

A few notes about the code.  

  • I don’t create a CustomGuide class this time. Instead, I use the built-in Depth-First search guide.
  • For the PathQualifier class, I’m telling the navigator to continue searching any paths (hops) less than or equal to the user-specified max degrees of separation
  • In the ResultsQualifier class I check the final hop in the path and if it’s a user, we’ve found a possible path. The last check makes sure we only return at most N results by an explicitly stopping the navigator

Running this query gave me the following graph in my application:

That’s it for now. Check in from time to time on this blog as I update the sample app with more data and new queries. I might even try some of the pre-built ones up on the StackExchange Data site and compare the performance results.

Post a Comment

Required fields are marked *
*
*

%d bloggers like this: