axiom//cloudlog.core - Rule semantics

Author: Temporarily Removed (temporarily@removed.com)

Date: 25 April 2018

Repository: https://github.com/temporarily/removed

Version: 0.4.1

1 defrule: Rule Definition Macro

1.1 Simple Rules

defrule defines a Cloudlog rule. Such a rule always starts with a fact pattern: a vector for which the first element is a keyword representing the fact name, and the rest of the elements are bindings, as we explain later.

The following rule – foo-yx, matches facts of the form [:test/foo x y] (facts named :test/foo with two arguments we call x and y), and for each such fact it creates a new fact of the form [foo-yz y x].

(defrule foo-yx [y x]
  [:test/foo x y] (by-anyone))

The (by-anyone) guard means that we do not care on behalf of whom this fact was published. Typically, we will care about this, but we delay discussion on this to our discussion of data integrity.

What defrule actually does is define a Clojure function that, when given the arguments for the source fact (in our case, :test/foo), it returns a sequence of bindings for the target fact (foo-yx).

(foo-yx [1 2]) => [[2 1]]

The function contains metadata regarding the identity of the source fact.

(-> foo-yx meta :source-fact) => [:test/foo 2]

The name and arity of the target fact is also metadata of the function.

(-> foo-yx meta :target-fact) => [::foo-yx 2]

The arguments of both the rule and the source fact are not limited to being variables. They can also be values.

When a fact is introduced to a rule, it is unified with source-fact part of the rule's body. Variables in the source-fact are bound to the corresponding values in the input fact, and values are compared. If the values differ, the rule is not applied to this fact.

For example, the following rule is similar to foo-yx, only that it assumes that x == 1.

(defrule foo-unify [x 1]
  [:test/foo 1 x] (by-anyone))

For [1 2], we will get the same result as before:

(foo-unify [1 2]) => [[2 1]]

But if we introduce a fact in which x != 1, the rule will not be applied, and the result sequence will be empty.

(foo-unify [2 3]) => empty?

1.2 Guards

While simple rules are useful in some cases, they are limited to reordering or restructuring the fields in the source fact, but cannot do more. Guards fix this by allowing (purely-functional) Clojure functions to be used inside rules.

Guards are Clojure forms such as let, for or when. The example below uses let to create a new binding (variable z), calculated as the sum of x and y:

(defrule foo-let [z]
     [:test/foo x y] (by-anyone)
     (let [z (+ x y)]))

The result is as you would expect.

(foo-let [1 2]) => [[3]]

Below is a more elaborate example. Here we index text documents by:

Extracting all the words from a document, and iterating over them using a for guard
Converting each word to lower-case (so that indexing becomes case-insensitive) using a let guard, and
Filterring out "stopwords", using a when-not guard

(def stop-words #{"a" "is" "to" "the"})
(defrule index-docs [word id]
     [:test/doc id text] (by-anyone)
     (for [word (clojure.string/split text #"[,!.? ]+")])
     (let [word (clojure.string/lower-case word)])
     (when-not (contains? stop-words word)))

Now, if we index a document, we will get index entries with the words it contains, lower-case, excluding stopwords.

(index-docs [1234 "Hello, to the  worlD!"]) => [["hello" 1234] ["world" 1234]]

Cloudlog guards differ from the corresponding Clojure forms in that they do not have a body. In the above code, the for form ends after the bindings have been established, and the same goes for the let and when-not forms. A corresponding Clojure implementation could look like this:

(for [word (clojure.string/split text #"[,!.? ]+")]
  (let [word (clojure.string/lower-case word)]
    (when-not (contains? stop-words word)
      (emit some result))))

with each form containing the following forms. However, Cloudlog is a logic programming language, like Prolog or core.logic. Cloudlog guards are just like predicates. Bindings in let and for forms assert a certain relationship between the bound variable and the expression to its right. A when or when-not guards are just like predicates that pass or fail depending on the (Clojure-level) predicate given to them.

1.3 Joins

Even with guards, rules are still limited to considering only a single fact. Sometimes we need to draw a conclusion based on a combination of facts. A classical example is applications such as Twitter, in which users can:

Follow other users,
Post tweets,
View their timelines, consisting of all the tweets made by users they follow.

To successfully generate a timeline, a rule needs to take into consideration both who follows whom, and tweets – two different kinds of facts. Moreover, there is a data dependency between the two. We are only interested in tweets made by users we follow.

Cloudlog rules can depend on more than just the source-fact.

(defrule timeline [user tweet]
     [:test/follows user author] (by-anyone)
     [:test/tweeted author tweet] (by-anyone))

In such cases, the rule function cannot produce the result right away. The above rule's source fact is :test/follows:

(-> timeline meta :source-fact) => [:test/follows 2]

However, from a :test/follows fact alone we cannot create a timeline entry. To create such an entry, we need to match it with a :test/tweeted fact.

To allow this, functions that represent rules that depend on more than one fact have continuations.

Continuations are functions, provided as metadata on the rule function.

(-> timeline meta :continuation) => fn?

The continuation function itself has metadata, indicating what its source-fact is.

(-> timeline meta :continuation meta :source-fact) => [:test/tweeted 2]

As in the case of simple rules, in case of a join, the rule function also returns a sequence of tuples, only that this time these tuples are not results, but rather continuations. Each tuple contains the information that the continuation function needs in order to resume the rule.

For example, the timeline rule above will emit the information it learned from the :test/follows fact it encountered.

(timeline ["alice" "bob"]) ; Alice follows Bob
 => [["bob" "alice" "bob"]]

Notice that "bob" appears twice in the tuple. Its second appearance is as the value for variable author. Its first appearance is as the key for the :test/tweeted fact. We'll discuss keys in more detail below.

This tuple can be used to construct a new rule function based on the continuation.

(let [cont (-> timeline meta :continuation) ; The continuation function we got as meta
     rule-func (cont ["bob" "alice" "bob"])] ; The new rule function
 (rule-func ["bob" "Hi, Alice!"]) ; :test/tweeted fact
 ) => [["alice" "Hi, Alice!"]]

Cloudlog tries to be true to its logic-programming nature, but since it is intended to work with large amounts of data, some restrictions need to be applied. In our case, the main restriction is that in any fact, the first argument is considered the key, and there are some restrictions and recommendations regarding keys. Generally, choosing the correct key has a significant impact on the performance of the application. A key must be specific enough so that all facts with the same key can be stored in memory at the same time.

When writing rules with joins, we need to make sure the key parameter for the joined fact is bound. For example, in the timeline rule, we chose the order of facts for a reason. :test/tweeted is keyed by author, and :test/follows is keyed by user. If we get the :test/follows first, we learn about the author who's tweets we need to consider. However, when we consider :test/tweeted first, this does not give us a clue regarding the :test/follows facts we need to consider for this tweet, since it does not provide value for user.

We provide a compile-time error in such cases.

(macroexpand `(defrule timeline [user tweet]
               [:test/tweeted author tweet] (by-anyone)
               [:test/follows user author] (by-anyone)))
 => (throws "variables #{cloudlog.core_test/user} are unbound in the key for :test/follows")

Of-course, guards are supported with joins.

(defrule foobar-range [w]
     [:test/foo x y] (by-anyone)
     (let [z (+ x y)])
     (for [w (range z)])
     [:test/bar z w] (by-anyone))

In the above rule we take x and y from fact :test/foo, sum them to get z, and span the range 0..z. We return w values for which there is a fact [:test/bar z w].

The rule function for foobar-range will return a tuple based on the guards (and not based on :test/bar, which is to be considered afterwards). The first element in each tuple is a key to be used against :test/bar (z), followed by the values of w and z:

(foobar-range [1 2]) => [[3 0 3] [3 1 3] [3 2 3]]

If a matching :test/bar fact exists, a result will be produced.

(let [cont (-> foobar-range meta :continuation)
     rule-func (cont [3 1 3])]
 (rule-func [3 1])) => [[1]]

However, if it does not, an empty result will be produced.

(let [cont (-> foobar-range meta :continuation)
     rule-func (cont [3 1 3])]
 (rule-func [4 1])) => []

1.4 Derived Facts

Each rule defines a derived fact, i.e., each tuple produced by a rule is stored as a fact. The name of this fact is the fully-quaified name of the rule function. This fact can then be used in other rules.

For example, if we wish to create a "trending" timeline, aggregating the timelines of users identified as "influencers", we would probably write a rule of the following form:

(defrule trending [tweet]
  [:test/influencer influencer] (by-anyone)
  [timeline influencer tweet])

Now we can simulate our rule (using simulate-with):

(simulate-with trending "cloudlog.core_test"
              [:test/influencer "gina"]
              [:test/influencer "tina"]
              (f [::timeline "tina" "purple is the new black!"] :writers #{"cloudlog.core_test"})
              (f [::timeline "gina" "pink is the new purple!"] :writers #{"cloudlog.core_test"})
              (f [::timeline "some-lamo" "orange is the new bananna"] :writers #{"cloudlog.core_test"}))
 => #{["purple is the new black!"]
    ["pink is the new purple!"]}

1.5 Integrity

Integrity of information refers to protecting information from being modified by unauthorized parties. (Confidentiality, Integrity, Availability: The Three Components of the CIA Triad)

Cloudlog takes a liberal approach to integrity. Any user can publish any fact, as long as he or she is a member of the fact's writers set. We explain writer sets in our discussion of events, but for now we can just say the writer set is a piece of information telling us to whom a certain fact is attributed to. Any user within the fact's writer set can, for example, delete it and replace it with another.

When a rule is applied to a fact, it takes ownership over the result (see events). This makes the rule responsible for the integrity of the result.

To allow rules to specify integrity requirements we introduce the by guard. Look at the timeline rule defined above. The :test/tweeted fact indicates the user who tweeted. However, there is no guarantee that the user who appears in the fact as the author of the tweet is indeed the user who created that fact. For example, consider the fact:

[:test/tweeted "alice" "I hate Bob!"]

What guarantee do we have that it is Alice who created this fact, and not Eve, who's been trying to break up Alice and Bob for years?

The answer is we don't actually know that, because the Cloudlog is liberal about integrity, and allows any user (including Eve) to create any fact (including tweets by Alice).

So what do we do? Do we allow Eve to succeed in her evil plan? No, we do not. This is where the by guard comes to save the day (and Alice's love life). Below is a secure version of the timeline rule, that only takes into account tweets made by the advertized user.

(defrule secure-timeline [user tweet]
  [:test/follows user author] (by-anyone)
  [:test/tweeted author tweet] (by [:user= author]))

Now, if both Eve and Alice create tweets alegedly made by Alice, Bob (who follows Alice) will only see the ones genuinly made by Alice.

(simulate-with secure-timeline
              (f [:test/follows "bob" "alice"] :writers #{[:user= "bob"]})
              (f [:test/tweeted "alice" "Alice loves Bob"] :writers #{[:user= "alice"]})
              (f [:test/tweeted "alice" "Alice hates Bob"] :writers #{[:user= "eve"]}))
 => #{["bob" "Alice loves Bob"]}

Now we come back to by-anyone and why we had to use it all over the place. If we do not use a by guard on one of the source facts in a rule we put our application at risk of having unauthorized updates take effect. For example, in the secure-timeline rule above we checked the integrity of tweets, but "forgot" to check the integrity of following relationships. This mistake can help Eve get messages through to Bob although he does not follow her:

(simulate-with secure-timeline
              (f [:test/follows "bob" "eve"] :writers #{[:user= "eve"]})
              (f [:test/tweeted "eve" "Alice hates Bob"] :writers #{[:user= "eve"]}))
 => #{["bob" "Alice hates Bob"]}

Both facts were submitted (legally) by Eve. The only flaw was in our logic – we did not protect against this. The by-anyone guard is here to set a warning that we are doing something wrong. Typically we never want to use it, unless we have a really good reason to.

But what if we don't use any guard whatsoever? Cloudlog will not allow us to do this. For example, if we forget to place a by* guard on foo-yx, we get the following error:

(macroexpand '(defrule foo-yx [y x]
               [:test/foo x y]))
 => (throws "Rule is insecure. :test/foo is not checked.")

The same goes for rules with joins:

(macroexpand '(defrule secure-timeline [user tweet]
               [:test/follows user author] (by-anyone)
               [:test/tweeted author tweet]))
 => (throws "Rule is insecure. :test/tweeted is not checked.")

One place where by* guards are not needed is derived facts. Derived facts are assumed to be written by a rule, for which the namespace (a permacode version) provides a unique identifier. Since permacode uses a strong cryptographic hash for versioning, we can trust this hash to identify the rule. Placing a malicious rule under a fake identity of one of the application's rules requires a preimage attack over the hash key. This means that the attacker would need to generate the malicious rule in such a way that it will have the same hash code as the innocent one. This is known to be infeasible for the SHA256, the algorithm we use.

Consider the trending rule we defined above. When it references timeline it doesn't use a by* guard. However, it is still guarded against malicious postings of timeline facts, and would only consider ones that are made by cloudlog.core_test – timeline's namespace.

In the following example, "orange is the new bananna" is a fake timeline entry made by user some-lamo presumably on behalf of tina. As you can see, Cloudlog blocks it from appearing as a trending result.

(simulate-with trending
              [:test/influencer "gina"]
              [:test/influencer "tina"]
              (f [::timeline "tina" "purple is the new black!"] :writers #{"cloudlog.core_test"})
              (f [::timeline "gina" "pink is the new purple!"] :writers #{"cloudlog.core_test"})
              (f [::timeline "tina" "orange is the new bananna"] :writers #{[:user= "some-lamo"]}))
 => #{["purple is the new black!"]
    ["pink is the new purple!"]}

2 defclause: Top-Down Logic

Regular rules defined using defrule define bottom-up logic. Bottom-up logic is applied when facts are added or removed, creating or removing derived facts as needed. Unfortunately, this is often not enough. One limitation of bottom-up reasoning (logic) is that it cannot take into account a goal – a clue for what we are trying to find. It tries to build all possible result tuples, but without guidance this can be hard at times. Clauses fix this by performing logic at query time, starting with a provided goal.

Consider the index-docs example above. When a new :test/doc fact is created, the index-docs rule creates an index entry for each keyword in the text. Now imagine we wish to use this index for retrieval. What we want is to allow users to provide a sequence of search keywords, and get the full text of all the document that match all these keywords.

Doing this with bottom-up reasoning is quite hard. We need to create index elements for every combination of keywords. Clauses allow us to start with a particular combination, retrieve documents based on one element (say, the first keyword) retrieve full texts and only accept those in which the other keywords appear.

The following clause does just that.

(defclause multi-keyword-search
  [:test/multi-keyword-search keywords -> text]
  (let [keywords (map clojure.string/lower-case keywords)
        first-kw (first keywords)])
  [index-docs first-kw id]
  [:test/doc id text] (by-anyone)
  (let [lc-text (clojure.string/lower-case text)])
  (when (every? #(clojure.string/includes? lc-text %)
                (rest keywords))))

multi-keyword-search is the name of the function to be created (similar to a rule function). :test/multi-keyword-search is the predicate to be associated with this clause. Unlike the function name which needs to be unique, the predicate can be shared across clauses. Users will eventually query predicates, not clauses. The vector [keywords] contains the input arguments to the predicate, and [text] is the vector of output arguments. In either case you can have more than one argument. Following these headers are the body element of the clause.

The source-fact for a clause is a question of the form :predicate-keyword?. For example:

(-> multi-keyword-search meta :source-fact) => [:test/multi-keyword-search? 2]

Here, :test/multi-keyword-search? is a question, a fact containing the input arguments, preceded by a $unique$ parameter, which identifies a specific question (hence the arity 2).

The answer is a derived fact of the form :predicate-keyword!:

(loop [rulefunc multi-keyword-search]
 (or (-> rulefunc meta :target-fact)
     (recur (-> rulefunc meta :continuation))))
 => [:test/multi-keyword-search! 2]

In this case, the answer's arity is also 2 – the unique identifier that matches the question and the output paramter.

(simulate-with multi-keyword-search
                 [:test/multi-keyword-search? 1234 ["hello" "world"]]
                 [:test/multi-keyword-search? 2345 ["foo" "bar"]]
                 (f [::index-docs "foo" "doc1"] :writers #{"cloudlog.core_test"})
                 (f [::index-docs "foo" "doc2"] :writers #{"cloudlog.core_test"})
                 (f [::index-docs "hello" "doc5"] :writers #{"cloudlog.core_test"})
                 [:test/doc "doc1" "Foo goes into a Bar..."]
                 [:test/doc "doc2" "Foo goes into a Pub..."]
                 [:test/doc "doc5" "World peace starts with a small Hello!"])
  => #{[1234 "World peace starts with a small Hello!"]
       [2345 "Foo goes into a Bar..."]}

3 f: A Convenience Function to Create Facts

In the following we will use the fact function, which allows us to create facts.

(let [f (f [:test/follows "alice" "bob"] :writers #{[:user= "alice"]} :readers #{"foobar"})]
 f => [:test/follows "alice" "bob"]
 (meta f) => {:writers #{[:user= "alice"]} :readers #{"foobar"}})

4 simulate-with: Evaluate a Rule based on facts

We believe in TDD as a "way of life" in the software world. The examples in this very document are tests that are written before the corresponding implementation. Regardless of whether we write the tests before or after the implementation, it is well agreed that automated tests are key to successful software projects. App developers using Cloudlog should be no exception.

We therefore provide the simulate-with function, which allows rules to be tested independently of the system implementing Cloudlog, without having to load data to databases, launch clusters etc.

4.1 simulate-with

The simulate-with function accepts the following arguments:

A single rule function to be simulated,
Zero or more facts, making up the test environmnet for the simulation.It returns a set of tuples, representing the different values the rule gets given the facts.For example:

(simulate-with timeline
              [:test/follows "alice" "bob"]
              [:test/follows "alice" "charlie"]
              [:test/tweeted "bob" "hello"]
              [:test/tweeted "charlie" "hi"]
              [:test/tweeted "david" "boo"])
 => #{["alice" "hello"]
    ["alice" "hi"]}

Resulting tuples are given a writer-set containing the given group identifier.

(-> (simulate-with timeline
                  [:test/follows "alice" "bob"]
                  [:test/tweeted "bob" "hello"])
   first meta :writers) => #{"cloudlog.core_test"}

The simulation calculates the reader-set for the result. Generally, this is the intersection of the reader sets of all participating facts.

(let [X #{:a :b}
     Y #{:b :c}]
 (-> (simulate-with timeline
                    (f [:test/follows "alice" "bob"] :readers X)
                    (f [:test/tweeted "bob" "hello"] :readers Y))
     first meta :readers) => (interset/intersection X Y))

4.2 Under the Hood

simulate-with is merely a combination of two lower-level functions: simulate* and with*:

4.2.1 with*

The with is replaced with a call to the with* function, which translates a sequence of facts to a map from fact names and arities to sets of value tuples. For example:

(with* [[:test/follows "alice" "bob"]
       [:test/tweeted "bob" "hello"]
       [:test/follows "bob" "charlie"]])
 => {[:test/follows 2] #{["alice" "bob"]
                       ["bob" "charlie"]}
   [:test/tweeted 2] #{["bob" "hello"]}}

with* moves metadata placed on facts over to the tuples the fact is converted to.

(let [with-map (with* [(f [:test/baz 1 2 3] :foo "bar")])]
 (-> (with-map [:test/baz 3]) first meta :foo) => "bar")

4.2.2 simulate*

This map is then given as a parameter to simulate*, along with the rule function to be simulated.

A simple rule is simulated by applying tuples from the set corresponding to the rule's source-fact, and then aggregating the results.

(simulate* foo-yx {[:test/foo 2] #{[1 2] [3 4]}}) => #{[2 1] [4 3]}

In rules with joins, the continuations are followed.

(simulate* timeline (with* [[:test/follows "alice" "bob"]
                           [:test/tweeted "bob" "hello"]]))
 => #{["alice" "hello"]}

5 simulate-rules-with: Perform a simulation Based on a Complete Namespace

simulate-with is useful for testing a single rule. However, sometimes we are interested in testing the integration of several rules together. simulate-rules-with is given a collection of rules (and possibly definitions that are not rules and are ignored), and a collection of facts.
It extends this set of facts with derived facts that are produced by applying the rules.

For example, consider the trending rule defined (here)[#derived-facts]. This rule aggregates the timelines of certain influencers into a single trending timeline.

(let [derived (simulate-rules-with [timeline trending]
                                  [[:test/influencer "alice"]
                                   [:test/follows "alice" "bob"]
                                   [:test/tweeted "bob" "hello"]])]
 (derived [:cloudlog.core_test/trending 1]) => #{["hello"]})

We topologically-sort the rules so the order in which they appear in the call to simulate-rules-with does not matter.

(let [derived (simulate-rules-with [trending timeline]
                                  [[:test/influencer "alice"]
                                   [:test/follows "alice" "bob"]
                                   [:test/tweeted "bob" "hello"]])]
 (derived [:cloudlog.core_test/trending 1]) => #{["hello"]})

The second argument (writer group identifier) has the same meaning as in simulate-with.

(let [derived (simulate-rules-with [trending timeline]
                                  [[:test/influencer "alice"]
                                   [:test/follows "alice" "bob"]
                                   [:test/tweeted "bob" "hello"]])]
 (-> (derived [:cloudlog.core_test/trending 1])
     first
     meta
     :writers) => #{"cloudlog.core_test"})

The rule collection can include elements that are not rules, which are ignored.

(let [derived (simulate-rules-with [1 "foo" timeline inc]
                                  [[:test/follows "alice" "bob"]
                                   [:test/tweeted "bob" "hello"]])]
 (derived [:cloudlog.core_test/timeline 2]) => #{["alice" "hello"]})

5.1 Under the Hood

The key to what simulate-rules-with is doing is sorting the given rule functions topologically using graph/toposort. This is done in the sort-rules function, which takes a collection of rules and sorts them by dependencies.

(sort-rules [trending timeline]) => [timeline trending]

Sorting should be arbitrary in the case where two rules (or clauses) have the same target fact, as in the case of clauses sharing a predicate, as follows:

(defclause foo-1
  [:test/foo a -> b]
  (let [b (inc a)]))

(defclause foo-2
  [:test/foo a -> b]
  (let [b (* a 2)]))

(set (sort-rules [foo-1 foo-2])) => #{foo-1 foo-2}

Many of the operations here are based on iterating on the continuations of a rule function. These are returned by the rule-cont function.

Typically, we are not going to use these functions directly, but rather map them to their meta properties:

(let [conts (rule-cont timeline)]
 (map #(-> % meta :source-fact) conts) => [[:test/follows 2] [:test/tweeted 2]])

The function rule-target-fact returns the target fact of a rule.

(rule-target-fact timeline) => [:cloudlog.core_test/timeline 2]

6 run-query: Applies a Clause on the Facts

Eventually, users query the state of the application through clauses. run-query takes a collection of rule functions, a query, its output arity, a reader set (the identity of whom is making the query), and a collection of facts. It returs a set of tuples returned from this query.

(let [rules (map (fn [[k v]] @v) (ns-publics 'cloudlog.core_test))]
 (run-query rules
            (f [:test/multi-keyword-search ["rant" "politics"]]) 1 #{}
            [(f [:test/doc 100 "This is a song about love."])
             (f [:test/doc 200 "This is a rant about politics."])
             (f [:test/doc 200 "This is a rant about love."])
             (f [:test/doc 300 "This is a doc about nothing."])])
 => #{["This is a rant about politics."]})

The query result will not include elements the user performing the query is not allowed to see.

(let [rules (map (fn [[k v]] @v) (ns-publics 'cloudlog.core_test))]
 (run-query rules
            (f [:test/multi-keyword-search ["this" "sentence"]]) 1 #{:me}
            [(f [:test/doc 100 "This is a sentence"] :readers interset/universe)
             (f [:test/doc 200 "This is another sentence"] :readers #{:someone-else})])
 => #{["This is a sentence"]})

run-query aggregates the results coming from different clauses of the same prediate. For example, the following holds with the above definitions of foo-1 and foo-2:

(run-query [foo-1 foo-2]
          (f [:test/foo 2]) 1 #{} [])
 => #{[3] [4]}

7 fact-table: Get a Fully-Qualified Name for a Fact

In an implementation of a Cloudlog engine it is necessary, given a :source-fact meta of a rule, to know which real-world resources (tables, queues etc) are associated with this fact. Doing so consistently is important to get different references to the same fact to work against the same resources.

The function fact-table takes a [name arity] pair that is given as the :source-fact of a rule (or a continuation) and returns a string representing this fact.

For raw facts (represented by Clojure keywords, e.g., :test/follows), the returned string is simply the fully qualified name of the keyword:

(-> timeline meta :source-fact fact-table) => "test/follows"

For a derived fact, represented as a name of a rule, the returned string is the fully qualified name of the rule.

(-> trending meta :continuation
   meta :source-fact fact-table) => "cloudlog.core_test/timeline"