This blog will follow my Google Summer of Code project, entitled Performance Improvements for the Graph Module of Sagemath. The complete project is available here, and related documents with partial results will be available on the same website.
In this first post, I would like to thank my mentor David Coudert and Nathann Cohen, who helped me a lot in writing this project and understanding how the graph module of Sagemath works.
With their help, and with the help of the Sage community, I hope it will be a useful and funny work! Let's start!

Many times, people asked me "Which is the best available graph library?", or "Which graph library should I use to compute this, or that?".
Well, personally I love to use Sage, but there are also several good alternatives. Then, the question becomes "How could we improve Sage, so that people will choose it?".

In my opinion, graph libraries are compared according to the following parameters:

simplicity and documentation: people have little time, and the faster they learn how to use the library, the better;
number of routines available;
speed: sometimes, the input is very big, and the algorithms take much time to finish, so that a fast implementation is fundamental.

While it is very difficult to measure the first point, the others can be compared and improved. For this reason, in order to outperform other libraries, we should implement new features, and improve existing ones. You don't say!

However, this answer is not satisfactory: in principle, we could add all features available in other libraries, but this is a huge translational work, and while we are doing this work the other libraries will change, making this effort a never-ending story.

My project proposes an alternative: cooperating instead of competing. I will try to interface Sage with other libraries, and to use their algorithms when the Sage counterpart is not available, or less efficient. This way, with an affordable amount of work, we will be able to run all algorithms available in the best graph libraries!

As a first step, I have compared all the most famous C, C++, and Python graph libraries according to points 2 and 3, in order to choose which libraries should be included. The next posts will analyze the results of this comparison.

As promised in the last post, I have compared the performances of several graph libraries, in order to choose which ones should be deployed with Sagemath. Here, I provide the main results of this analysis, while more details are available on my website (see also the links below).
The libraries chosen are the most famous graph libraries written in Python, C, or C++ (I have chosen these languages because they are easier to integrate in Sagemath, using Cython). Furthermore, I have excluded NetworkX, which is already deployed with Sagemath.
First of all, I have to enforce that no graph library comparison can be completely fair, and also this comparison can be criticized, due to the large amount of available routines, to the constant evolution of libraries, and to many small differences in the outputs (for instance, one library might compute the value of a maximum s-t flow, another library might actually compute the flow, and a third one might compute all maximum flows). Despite this, I have tried to be as fair as possible, through a deeper and more detailed analysis than previous comparisons (https://graph-tool.skewed.de/performance, http://www.programmershare.com/3210372/, http://arxiv.org/pdf/1403.3005.pdf).
The first comparison deals with the number of algorithms implemented. I have chosen a set of 107 possible algorithms, trying to cover all possible tasks that a graph library should perform (avoiding easy tasks that are common to all libraries, like outputting the number of nodes, the number of edges, the neighbors of a node, etc). In some cases, two tasks were collapsed in one, if the algorithms solving these tasks are very similar (for instance, computing a maximum flow and computing a minimum cut, computing vertex betweenness and edge betweenness, etc).
The number of routines available for each library is plotted in the following chart, and a table containing all features is available in HTML or as a Google Sheet.

The results show that Sagemath has more routines than all competitors (66), closely followed by igraph (62). All other libraries are very close to each other, having about 30 routines each. Furthermore, Sagemath could be improved in the fields of neighbor similarity measures (assortativity, bibcoupling, cocitation, etc), community detection, and random graph generators. For instance, igraph contains 29 routines that are not available in Sagemath.

The second comparison analyzes the running-time of some of the algorithms implemented in the libraries. In particular, I have chosen 8 of the most common tasks in graph analysis: computing the diameter, computing the maximum flow between two vertices, finding connected components and strongly connected components, computing betweenness centrality, computing the clustering coefficient, computing the clique number, and generating a graph with the preferential attachment model. I have run each of these algorithms on 3 inputs, and I have considered the total execution time (excluding the time needed to load the graph). More details on this experiment are available here, and the results are also available in a Google Sheet.
In order to make the results more readable, I have plotted the ratio between the time needed by a given library and the minimum time needed by any library. If an algorithm was not implemented, or it needed more than 3 hours to complete, the corresponding bar is not shown.

Overall, the results show that NetworKit is the fastest library, or one of the fastest, in all routines that are implemented (apart from the generation of preferential attachment graphs, where it is very slow). Boost graph library is very close to NetworKit, and it also contains more routines. Also Sagemath is quite efficient in all tasks, apart from the computation of strongly connected components and the generation of a preferential attachment graph, where it needed more than 3 hours. However, in the latter case, the main problem was not speed but memory consumption.

In conclusion, Sagemath can highly benefit from the possibility of using algorithms from other libraries. First of all, it might improve the number of algorithms offered, especially by including igraph, and it also might improve its performance, by including Boost, NetworKit, or other fast graph libraries.

Every year there is a big festival in Edinburgh called the fringe festival. I blogged about this a while ago, in that post I did a very basic bit of natural language processing aiming to try and identify what made things funny. In this blog post I’m going to push that a bit further by building a classification model that aims to predict if a joke is funny or not. (tldr: I don’t really succeed but but that’s mainly because I have very little data - having more data would not necessarily guarantee success either but the code and approach is what’s worth taking from this post… 😪).

If you want to skip the brief description and go straight to look at the code you can find the ipython notebook on github here and on cloud.sagemath here.

The data comes from a series of BBC articles which reports (more or less every year since 2011?) the top ten jokes at the fringe festival. This does in fact only give 60 odd jokes to work with…

Here is the latest winner (by Tim Vine):

I decided to sell my Hoover… well it was just collecting dust.

After cleaning it up slightly I’ve thrown that all in a json file here. So in order to import the data in to a panda data frame I just run:

importpandasdf=pandas.read_json('jokes.json')# Loading the json file

Pandas is great, I’ve been used to creating my own bespoke classes for handling data but in general just using pandas does the exact right job. At this point I basically follow along with this post on sentiment analysis of twitter which makes use of the ridiculously powerful nltk library.

We can use the nltk library to ‘tokenise’ and get rid of common words:

commonwords=[e.upper()foreinset(nltk.corpus.stopwords.words('english'))]# <- Need to download the corpus: import nltk; nltk.download()commonwords.extend(['M','VE'])# Adding a couple of things that need to be removedtokenizer=nltk.tokenize.RegexpTokenizer(r'\w+')# To be able to strip out unwanted things in stringsstring_to_list=lambdax:[el.upper()forelintokenizer.tokenize(x)ifel.upper()notincommonwords]df['Joke']=df['Raw_joke'].apply(string_to_list)

Note that this requires downloading one of the awesome corpuses (thats apparently the right way to say that) from nltk.

Here is how this looks:

joke='I decided to sell my Hoover... well it was just collecting dust.'string_to_list(joke)

which gives:

['DECIDED','SELL','HOOVER','WELL','COLLECTING','DUST']

We can now get started on building a classifier

Here is the general idea of what will be happening:

First of all we need to build up the ‘features’ of each joke, in other words pull the words out in to a nice easy format.

To do that we need to find all the words from our training data set, another way of describing this is that we need to build up our dictionary:

df['Year']=df['Year'].apply(int)defget_all_words(dataframe):"""    A function that gets all the words from the Joke column in a given dataframe"""all_words=[]forjkindataframe['Joke']:all_words.extend(jk)returnall_wordsall_words=get_all_words(df[df['Year']<=2013])# This uses all jokes before 2013 as our training data set.

We then build something that will tell us for each joke which of the overall words is in it:

defextract_features(joke,all_words):words=set(joke)features={}forwordinwords:features['contains(%s)'%word]=(wordinall_words)returnfeatures

Once we have done that, we just need to decide what we will call a funny joke. For this purpose We’ll use a funny_threshold and any joke that ranks above the funny_threshold in any given year will be considered funny:

funny_threshold=5df['Rank']=df['Rank'].apply(int)df['Funny']=df['Rank']<=funny_threshold

Now we just need to create a tuple for each joke that puts the features mentioned earlier and a classification (if the joke was funny or not) together:

df['Labeled_Feature']=zip(df['Features'],df['Funny'])

We can now (in one line of code!!!!) create a classifier:

classifier=nltk.NaiveBayesClassifier.train(df[df['Year']<=2013]['Labeled_Feature'])

This classifier will take into account all the words in a given joke and spit out if it’s funny or not. It can also give us some indication as to what makes a joke funny or not:

classifier.show_most_informative_features(10)

Here is the output of that:

MostInformativeFeaturescontains(GOT)=TrueFalse:True=2.4:1.0contains(KNOW)=TrueTrue:False=1.7:1.0contains(PEOPLE)=TrueFalse:True=1.7:1.0contains(SEX)=TrueFalse:True=1.7:1.0contains(NEVER)=TrueFalse:True=1.7:1.0contains(RE)=TrueTrue:False=1.6:1.0contains(FRIEND)=TrueTrue:False=1.6:1.0contains(SAY)=TrueTrue:False=1.6:1.0contains(BOUGHT)=TrueTrue:False=1.6:1.0contains(ONE)=TrueTrue:False=1.5:1.0

This immediately gives us some information:

If your joke is about SEX is it more likely to not be funny.
If your joke is about FRIENDs is it more likely to be funny.

That’s all very nice but we can now (theoretically - again, I really don’t have enough data for this) start using the mathematical model to tell you if something is funny:

joke='Why was 10 afraid of 7? Because 7 8 9'classifier.classify(extract_features(string_to_list(joke),get_all_words(df[df['Year']<=2013])))

That joke is apparently funny (the output of above is True). The following joke however is apparently not (the output of below if False):

joke='Your mother is ...'printclassifier.classify(extract_features(string_to_list(joke),get_all_words(df[df['Year']<=2013])))

As you can see in the ipython notebook it is then very easy to measure how good the predictions are (I used the data from years before 2013 to predict 2014).

Results

Here is a plot of the accuracy of the classifier for changing values of funny_threshold:

You’ll notice a couple of things:

When the threshold is 0 or 1: the classifier works perfectly. This makes sense: all the jokes are either funny or not so it’s very easy for the classifier to do well.
There seems to be a couple of regions where the classifier does particularly poorly: just after a value of 4. Indeed there are points where the classifier does worse than flipping a coin.
At a value of 4, the classifier does particularly well!

Now, one final thing I’ll take a look at is what happens if I start randomly selecting a portion of the entire data set to be the training set:

Below are 10 plots that correspond to 50 repetitions of the above where I randomly sample a ratio of the data set to be the training set:

Finally (although it’s really not helpful), here are all of those on a single plot:

First of all: all those plots are basically one line of seaborn code which is ridiculously cool. Seaborn is basically magic:

sns.tsplot(data,steps)

Second of all, it looks like the lower bound of the classifiers is around .5. Most of them start of at .5, in other words they are as good as flipping a coin before we let them learn from anything, which makes sense. Finally it seems that the threshold of 4 classifier seems to be the only one that gradually improves as more data is given to it. That’s perhaps indicating that something interesting is happening there but that investigation would be for another day.

All of the conclusions about the actual data should certainly not be taken seriously: I simply do not have enough data. But, the overall process and code is what is worth taking away. It’s pretty neat that the variety of awesome python libraries lets you do this sort of thing more or less out of the box.

Please do take a look at this github repository but I’ve also just put the notebook on cloud.sagemath so assuming you pip install the libraries and get the data etc you can play around with this right in your browser:

Here is the notebook on cloud.sagemath.

We (James Campbell and Vince Knight are writing this together) have been working on implementing code in Sage to test if a game is degenerate or not. In this post we’ll prove a simple result that is used in the algorithm that we are/have implemented.

Bi-Matrix games

For a general overview of these sorts of things take a look at this post from a while ago on the subject of bi-matrix games in Sage. A bi-matrix is a matrix of tuples corresponding to payoffs for a 2 player Normal Form Game. Rows represent strategies for the first player and columns represent strategies for the second player, and each tuple of the bi-matrix corresponds to a tuple of payoffs. Here is an example:

We see that if the first player plays their third row strategy and the second player their second column strategy then the first player gets a utility of 6 and the second player a utility of 1.

This can also be written as two separate matrices. A matrix $A$ for Player 1 and $B$ for Player 2.

Here is how this can be constructed in Sage using the NormalFormGame class:

sage:A=matrix([[3,3],[2,5],[0,6]])sage:B=matrix([[3,2],[2,6],[3,1]])sage:g=NormalFormGame([A,B])sage:gNormalFormGamewiththefollowingutilities:{(0,1):[3,2],(0,0):[3,3],(2,1):[6,1],(2,0):[0,3],(1,0):[2,2],(1,1):[5,6]}

Currently, within Sage, we can obtain the Nash equilibria of games:

sage:g.obtain_nash()[[(0,1/3,2/3),(1/3,2/3)],[(4/5,1/5,0),(2/3,1/3)],[(1,0,0),(1,0)]]

We see that this game has 3 Nash equilibria. For each, we see that the supports (the number of non zero entries) of both players’ strategies are the same size. This is, in fact, a theoretical certainty when games are non degenerate.

If we modify the game slightly:

sage:A=matrix([[3,3],[2,5],[0,6]])sage:B=matrix([[3,3],[2,6],[3,1]])sage:g=NormalFormGame([A,B])sage:g.obtain_nash()[[(0,1/3,2/3),(1/3,2/3)],[(1,0,0),(2/3,1/3)],[(1,0,0),(1,0)]]

We see that the second equilibrium has supports of different sizes. In fact, if the first player did play $(1,0,0)$ (in other words just play the first row) the second player could play any mixture of strategies as a best response and not particularly $(2/3,1/3)$. This is because the game in consideration is now degenerate.

(Note that both of the games above are taken from Nisan et al. 2007 [pdf].)

What is a degenerate game

A bimatrix game is called nondegenerate if the number of pure best responses to a mixed strategy never exceeds the size of its support. In a degenerate game, this definition is violated, for example if there is a pure strategy that has two pure best responses (as in the example above), but it is also possible to have a mixed strategy with support size $k$ that has $k+1$ strategies that are a best response.

Here is an example of this:

If we consider the mixed strategy for player 2: $y=(1/2,1/2)$, then the utility to player 1 is given by:

We see that there are 3 best responses to $y$ and as $y$ has support size 2 this implies that the game above is degenerate.

What does the literature say about degenerate games

The original definition of degenerate games was given in Lemke, Howson 1964 [pdf] and their definition was dependent on the labeling polytope that they used for their famous algorithm for the computation of equilibria (which is currently being implemented in Sage!). Further to this Stengel 1999 [ps] offers a nice overview of a variety of equivalent definitions.

Sadly, all of these definitions require finding a particular mixed strategy profile $(x, y)$ for which a particular condition holds. To be able to implement a test for degeneracy based on any of these definitions would require a continuous search over possible mixed strategy pairs.

In the previous example (where we take $y=(1/2,1/2)$ we could have identified this $y$ by looking at the utilities for each pure strategy for player 1 against $y=(y_1, 1-y_1)$:

($r_i$ denotes row strategy $i$ for player 1.) A plot of this is shown:

We can (in this instance) quickly search through values of $y_1$ and identify the point that has the most best responses which gives the best chance of passing the degeneracy condition ($y_1=1/2$). This is not really practical from a generic point of view which leads to this blog post: we have identified what the particular $x, y$ is that is sufficient to test.

A sufficient mixed strategy to test for degeneracy

The definition of degeneracy can be written as:

Def. A Normal Form Game is degenerate iff:

There exists $x\in \Delta X$ such that $ |S(x)| < |\sigma_2| $ where $\sigma_2$ is the support such that $ (xB)_j = \max(xB) $, for all $j $ in $ \sigma_2$.

There exists $y\in \Delta Y$ such that $ |S(x)| < |\sigma_1| $ where $\sigma_1$ is the support such that $ (Ay)_i = \max(Ay) $, for all $i $ in $ \sigma_1$.

($X$ and $Y$ are the pure strategies for player 1 and 2 and $\Delta X, \Delta Y$ the corresponding mixed strategies spaces.

The result we are implementing in Sage aims to remove the need to search particular mixed strategies $x, y$ (a continuous search) and replace that by a search over supports (a discrete search).

Theorem. A Normal Form Game is degenerate iff:

There exists $ \sigma_1 \subseteq X $ and $ \sigma_2 \subseteq Y $ such that $ |\sigma_1| < |\sigma_2| $ and $ S(x^*) = \sigma_1 $ where $ x^* $ is a solution of $ (xB)_j = \max(xB) $, for all $j $ in $ \sigma_2 $ (note that a valid $x^*$ is understood to be a mixed strategy vector).

There exists $ \sigma_1 \subseteq X $ and $ \sigma_2 \subseteq Y $ such that $ |\sigma_1| > |\sigma_2| $ and $ S(y^*) = \sigma_2 $ where $ y^* $ is a solution of $ (Ay)_i = \max(Ay) $, for all $i $ in $ \sigma_1 $.

Using the definition given above the proof is relatively straightforward but we will include it below (mainly to try and convince ourselves that we haven’t made a mistake).

We will only consider the first part of each condition (the ones for the first player). The result follows in the same way for the second player.

Proof $\Leftarrow$

Assume a game defined by $A, B$ is degenerate, by the above definition without loss of generality this implies that there exists an $x\in \Delta X$ such that $ |S(x)| < |\sigma_2| $ where $\sigma_2$ is the support such that $ (xB)_j = \max(xB) $, for all $j $ in $ \sigma_2$.

If we denote $S(x)$ by $\sigma_1$ then the definition implies that $|\sigma_1| < |\sigma_2| $ and further more that $ (xB)_j = \max(xB) $, for all $j $ in $ \sigma_2 $ as required.

Proof $\Rightarrow$

Implementation

This result implies that we simply need to consider all potential pairs of supports. Depending on the relative size of the supports we can use one of the two conditions of the result. If we ordered the supports by size the situation for the two player game looks somewhat like this:

Note that for an $m\times n$ game there are $(2^m-1)$ potential supports for player 1 (the size of the powerset of strategy set without the empty set) and $(2^n-1)$ potential supports of for player 2. Thus the rectangle drawn above has dimension $(2^m-1)\times(2^n-1)$. Needless to say that our implementation will not be efficient (testing degeneracy is after all an NP complete problem in linear programming (see Chandrasekaran 1982 - [pdf]) but at least we have identified exactly which mixed strategy we need to test for each support pair.

References

Chandrasekaran, R., Santosh N. Kabadi, and Katta G. Murthy. “Some NP-complete problems in linear programming.” Operations Research Letters 1.3 (1982): 101-104. [pdf]
Lemke, Carlton E., and Joseph T. Howson, Jr. “Equilibrium points of bimatrix games.” Journal of the Society for Industrial & Applied Mathematics 12.2 (1964): 413-423. [pdf]
N Nisan, T Roughgarden, E Tardos, VV Vazirani Vol. 1. Cambridge: Cambridge University Press, 2007. [pdf]
von Stengel, B. “Computing equilibria for two person games.” Technical report. [ps]

After two weeks, we have managed to interface Boost and Sagemath!

However, the interface was not as simple as it seemed. The main problem we found is the genericity of Boost: almost all Boost algorithms work with several graph implementations, which differ in the data structures used to store edges and vertices. For instance, the code that implements breadth-first search works if the adjacency list of a vertex v is a vector, a list, a set, etc. This result is accomplished by using templates [1]. Unfortunately, the only way to interface Sagemath with C++ code is Cython, which is not template-friendly, yet. In particular, Cython provides genericity through fused types [2], whose support is still experimental, and which do not offer full integration with templates [3-5].

After a thorough discussion with David, Nathann, and Martin (thank you very much!), we have found a solution: for the input, we have defined a fused type "BoostGenGraph", including all Boost graph implementations, and all functions that interface Boost and Sagemath use this fused type. This way, for each algorithm, we may choose the most suitable graph implementation. For the output, whose type might be dependent on the input type, we use C++ to transform it into a "standard" type (vector, or struct).

We like this solution because it is very clean, and it allows us to exploit Boost genericity without any copy-paste. Still, there are some drawbacks:
1) Cython fused types do not allow nested calls of generic functions;
2) Boost graphs cannot be converted to Python objects: they must be defined and deleted in the same Cython function;
3) No variable can have a generic type, apart from the arguments of generic functions.

These drawbacks will be overcome as soon as Cython makes templates and generic types interact: this way, we will be able create a much stronger interface, by writing a graph backend based on Boost, so that the user might create, convert, and modify Boost graphs directly from Python. However, for the moment, we will implement all algorithms using the current interface, which already provides genericity, and which has no drawback if the only goal is to "steal" algorithms from Boost.

As a test, we have computed the edge connectivity of a graph through Boost: the code is available in ticket 18564 [6]. Since the algorithm provided by Sagemath is not optimal (it is based on linear programming), the difference in the running time is impressive, as shown by the following tests:

sage: G = graphs.RandomGNM(100,1000)
sage: %timeit G.edge_connectivity()
100 loops, best of 3: 1.42 ms per loop
sage: %timeit G.edge_connectivity(implementation="sage")
1 loops, best of 3: 11.3 s per loop

sage: G = graphs.RandomBarabasiAlbert(300,3)
sage: %timeit G.edge_connectivity(implementation="sage")
1 loops, best of 3: 9.96 s per loop
sage: %timeit G.edge_connectivity()
100 loops, best of 3: 3.33 ms per loop

Basically, on a random Erdos-Renyi graph with 100 vertices and 1000 edges, the new algorithm is 8,000 times faster, and on a random Barabasi-Albert graph with 300 nodes and average degree 3, the new algorithm is 3,000 times faster! This way, we can compute the edge connectivity of much bigger graphs, like a random Erdos-Renyi graph with 5,000 vertices and 50,000 edges:

sage: G = graphs.RandomGNM(5,000, 50,000)
sage: %timeit G.edge_connectivity()
1 loops, best of 3: 16.2 s per loop

The results obtained with this first algorithm are very promising: in the next days, we plan to interface several other algorithms, in order to improve both the number of available routines and the speed of Sagemath graph library!

[1] https://en.wikipedia.org/wiki/Template_%28C%2B%2B%29
[2] http://docs.cython.org/src/userguide/fusedtypes.html
[3] https://groups.google.com/forum/#!topic/cython-users/qQpMo3hGQqI
[4] https://groups.google.com/forum/#!searchin/cython-users/fused/cython-users/-7cHr6Iz00Y/Z8rS03P7-_4J
[5] https://groups.google.com/forum/#!searchin/cython-users/fused$20template/cython-users/-7cHr6Iz00Y/Z8rS03P7-_4J
[6] http://trac.sagemath.org/ticket/18564

Hello!
My Google Summer of Code project is continuing, and I am currently trying to include more Boost algorithms in Sage. In this post, I will make a list of the main algorithms I'm working on.

Clustering Coefficient

If two different people have a friend in common, there is a high chance that they will become friends: this is the property that the clustering coefficient tries to capture. For instance, if I pick two random people, very probably they will not know each other, but if I pick two of my acquaintances, very probably they will know each other. In this setting, the clustering coefficient of a person is the probability that two random acquaintances of this person know each other. In order to quantify this phenomenon, we can formalize everything in terms of graphs: people are nodes and two people are connected if they are acquaintances. Hence, we define the clustering coefficient of a vertex $v$ in a graph $G=(V,E)$ as:
$$\frac{2|\{(x,y) \in E:x,y \in N_v\}|}{\deg(v)(\deg(v)-1)}$$ where $N_v$ is the set of neighbors of $v$ and $\deg(v)$ is the number of neighbors of $v$. This is exactly the probability that two random neighbors of $v$ are linked with an edge.
My work has included in Sagemath the Boost algorithm to compute the clustering coefficient, which is more efficient that the previous algorithm, which was based on NetworkX:

sage: g = graphs.RandomGNM(20000,100000)
sage: %timeit g.clustering_coeff(implementation='boost')
10 loops, best of 3: 258 ms per loop
sage: %timeit g.clustering_coeff(implementation='networkx')
1 loops, best of 3: 3.99 s per loop

But Nathann did better: he implemented a clustering coefficient algorithm from scratch, using Cython, and he managed to outperform the Boost algorithm, at least when the graph is dense. Congratulations, Nathann! However, when the graph is sparse, Boost algorithm still seems to be faster.

Dominator tree

Let us consider a road network, that is, a graph where vertices are street intersections, and edges are streets. The question is: if I close an intersection, where am I still able to go, assuming I am at home?
The answer to this question can be summarized in a dominator tree. Assume that, in order to go from my home to my workplace, I can choose many different paths, but all these paths pass through the café, then they pass through the square (that is, if either the café or the square is closed, then there is no way I can go to work). In this case, in the dominator tree, the father of my workplace is the square, the father of the square is the café, and the father of the café is my home, that is also the root of the tree. More formally, given a graph $G$, the dominator tree of $G$ rooted at a vertex $v$ is defined by connecting each vertex $x$ with the last vertex $y \neq x$ that belongs to each path from $v$ to $x$ (note that this vertex always exists, because $v$ belongs to each path from $v$ to $x$).
Until now, Sagemath did not have a routine to compute the dominator tree: I have been able to include the Boost algorithm. Unfortunately, due to several suggestions and improvements in the code, the ticket is not closed, yet. Hopefully, it will be closed very soon!

Cuthill-McKee ordering / King ordering

Let us consider a graph $G=(V,E)$: a matrix $M$ of size $|V|$ can be associated to this graph, where $M_{i,j}=1$ if and only if there is an edge between vertices $i$ and $j$.
In some cases, this matrix can have specific properties, that can be exploited for many purposes, like speeding-up algorithms. One of this properties is bandwidth, which measures how far the matrix is from a diagonal matrix: it is defined as $\max_{M_{i,j} \neq 0}|i-j|$. A small bandwidth might help in computing several properties of the graph, like eigenvalues and eigenvectors.
Since the bandwidth depends on the order of vertices, we can try to permute them in order to obtain a smaller value: in Sage, we have a routine that performs this task. However, this routine is very slow, and it is prohibitive even for very small graphs (in any case, finding an optimal ordering is NP-hard).
Hence, researchers have developed heuristics to compute good orderings: the most important ones are Cuthill-McKee ordering and King ordering. Boost contains both routines, but Sage does not: for this reason, I would like to insert these two functions. The code is almost ready, but part of it depends on the code of the dominator tree: as soon as the dominator tree is reviewed, I will open a ticket on these two routines!

Dijkstra/Bellman-Ford/Johnson shortest paths

Let us consider again a road network. In this case, we are building a GPS software, which has to compute the shortest path between the place where we are and the destination. The textbook algorithm that performed this task is Dijkstra algorithm, which computes the distance between the starting point and any other reachable point (of course, there are more efficient algorithms involving a preprocessing, but Dijkstra is the most simple, and its running-time is asymptotically optimal). This algorithm is already implemented in Sagemath.
Let's spice things up: what if that there are some streets with negative length? For instance, we like a street so much that we are willing to drive 100km more just to pass from that street, which is 50km long. It is like that street is -50km long!
First of all, under these assumptions, a shortest path might not exist: if there is a cycle with negative length, we may drive along that cycle all the times we want, decreasing more and more the distance to the destination. At least, we have to assume that no negative cycle exists.
Even with this assumption, Dijkstra algorithm does not work, and we have to perform Bellman-Ford algorithm, which is less efficient, but more general. Now, assume that we want something more: we are trying to compute the distance between all possible pairs of vertices. The first possibility is to run Bellman-Ford algorithm $n$ times, where $n$ is the number of nodes in the graph. But there is a better alternative: it is possible to perform Bellman-Ford algorithm only once, and then to modify the lengths of edges, so that all lengths are positive, and shortest paths are not changed. This way, we run Dijkstra algorithm $n$ times on this modified graph, obtaining a better running time. This is Johnson algorithm.
Both Bellman-Ford and Johnson algorithms are implemented in Boost and not in Sagemath. As soon as I manage to create weighted Boost graphs (that is, graphs where edges have a length), I will include also these two algorithm!

It has been quite some time since my last update on the progress of my Google Summer of Code project, which has two reasons. On the one hand, I have been busy because of the end of the semester, as well as because of the finalization of my Master’s thesis — and on the other hand, it is not very interesting to write a post on discussing and implementing rather technical details. Nevertheless, Daniel Krenn and myself have been quite busy in order to bring asymptotic expressions to SageMath. Fortunately, these efforts are starting to become quite fruitful.

In this post I want to discuss our current implementation roadmap (i.e. not only for the remaining Summer of Code, but also for the time afterwards), and give some examples for what we are currently able to do.

Strutcture and Roadmap

An overview of the entire roadmap can be found at here (trac #17601). Recall that the overall goal of this project is to bring asymptotic expressions like $2^n + n^2 \log n + O(n)$ to Sage. Our implementation (which aims to be as general and expandable as possible) tackles this problem with a three-layer approach:

GrowthGroups and GrowthElements (trac #17600). These elements and parents manage the growth (and just the growth!) of a summand in an asymptotic expression like above. The simplest cases are monomial and logarithmic growth groups. For example, their elements are given by $n^r$ and $\log(n)^r$ where the exponent $r$ is from some ordered ring like $\mathbb{Z}$ or $\mathbb{Q}$. Both cases (monomial and logarithmic growth groups) can be handled in the current implementation — however, growth elements like $n^2 \log n$ are intended to live in the cartesian product of a monomial and a logarithmic growth group (in the same variable). Parts of this infrastructure are already prepared (see trac #18587).
AsymptoticTerms and TermMonoids (trac #17715). While GrowthElements only represent the growth, AsymptoticTerms have more information: basically, they represent a summand in an asymptotic expression. There are different classes for each type of asymptotic term (e.g. ExactTerm and OTerm — with more to come). Additionally to a growth element, some types of asymptotic terms (like exact terms) also possess a coefficient.
AsymptoticExpression and AsymptoticRing (trac #17716). This is what we are currently working on, and we do have a running prototype! The version that can be found on trac is only missing some doctests and a bit of documentation. Asymptotic expressions are the central objects within this project, and essentially they are sums of several asymptotic terms. In the background, we use a special data structure (“mutable posets“, trac #17693) in order to model the (partial) order induced by the various growth elements belonging to an asymptotic expression. This allows to perform critical operations like absorption (when an $O$-term absorbs “weaker” terms) efficiently and in a simple way.

The resulting minimal prototype can, in some sense, be compared to Sage’s PowerSeriesRing: however, we also allow non-integer coefficients, and extending this prototype to work with multivariate expressions should not be too hard now, as the necessary infrastructure is there.

Following the finalization of the minimal prototype, there are several improvements to be made. Here are some examples:

Besides addition and multiplication, we also want to divide asymptotic expressions, and higher-order operations like exponentiation and taking the logarithm would be interesting as well.
Also, conversion from, for example, the symbolic ring is important when it comes to usability of our tools. We will implement and enhance this conversion gradually.

Examples

An asymptotic ring (over a monomial growth group with coefficients and exponents from the rational field) can be created with

sage: R.<x> = AsymptoticRing('monomial', QQ); R
Asymptotic Ring over Monomial Growth Group in x over Rational Field with coefficients from Rational Field

Note that we marked the code as experimental, meaning that you will see some warnings regarding the stability of the code. Now, as we have an asymptotic ring, we can do some calculations. For example, take $ (2\sqrt{x} + O(1))^{15}$:

sage: (2*x^(1/2) + O(x^0))^15
O(x^7) + 32768*x^(15/2)

We can also have a look at the underlying structure:

sage: expr = (x^(3/7) + 2*x^(1/5)) * (x + O(x^0)); expr
O(x^(3/7)) + 2*x^(6/5) + 1*x^(10/7)
sage: expr.poset
poset(O(x^(3/7)), 2*x^(6/5), 1*x^(10/7))
sage: print expr.poset.full_repr()
poset(O(x^(3/7)), 2*x^(6/5), 1*x^(10/7))
+-- null
|   +-- no predecessors
|   +-- successors:   O(x^(3/7))
+-- O(x^(3/7))
|   +-- predecessors:   null
|   +-- successors:   2*x^(6/5)
+-- 2*x^(6/5)
|   +-- predecessors:   O(x^(3/7))
|   +-- successors:   1*x^(10/7)
+-- 1*x^(10/7)
|   +-- predecessors:   2*x^(6/5)
|   +-- successors:   oo
+-- oo
|   +-- predecessors:   1*x^(10/7)
|   +-- no successors

As you might have noticed, the “O”-constructor that is used for the PowerSeriesRing and related structures, can also be used here. In particular, $O(\mathit{expr})$ acts exactly as expected:

sage: expr
O(x^(3/7)) + 2*x^(6/5) + 1*x^(10/7)
sage: O(expr)
O(x^(10/7))

Of course, the usual rules for computing with asymptotic expressions hold:

sage: O(x) + O(x)
O(x)
sage: O(x) - O(x)
O(x)

So far, so good. Our next step is making the multivariate growth groups usable for the AsymptoticRing and then improving the overall user interface of the ring.

This past week I have been delighted to have a short pedagogic paper accepted for publication in MSOR Connections. The paper is entitled: “Playing Games: A Case Study in Active Learning Applied to Game Theory”. The journal is open access and you can see a pre print here. As well as describing some literature on active learning I also present some data I’ve been collecting (with the help of others) as to how people play two subsequent plays of the two thirds of the average game (and talk about another game also).

In this post I’ll briefly put up the results here as well as mention a Python library I’m working on.

If you’re not familiar with it, the two thirds of the average game asks players to guess a number between 0 and 100. The closest number to 2/3rds of the average number guessed is declared the winner.

I use this all the time in class and during outreach events. I start by asking participants to play without more explanation than the basic rules of the game. Following this, as a group we go over some simple best response dynamics that indicate that the equilibrium play for the game is for everyone to guess 0. After this explanation, everyone plays again.

Below you can see how this game has gone as a collection of all the data I’ve put together:

You will note that some participants actually increase their second guess but in general we see a possible indication (based on two data points, so obviously this is not meant to be a conclusive statement) of convergence towards the theoretic equilibria.

Here is a plot showing the relationship between the first and second guess (when removing the guesses that increase, although as you can see in the paper this does not make much difference):

The significant linear relationship between the guesses is given by:

So a good indication of what someone will guess in the second round is that it would be a third of their first round guess.

Here is some Sage code that produces the cobweb diagram assuming the following sequence represents each guess (using code by Marshall Hampton):

that plot shows the iterations of the hypothetical guesses if we were to play more rounds :)

The other thing I wanted to point at in this blog post is this twothirds library which will potentially allow anyone to analyse these games quickly. I’m still working on it but if it’s of interest please do jump in :) I have put up a Jupyter notebook demoing what it can do so far (which is almost everything but with some rough edges). If you want to try it out, download that notebook and run:

$ pip install twothirds

I hope that once the library is set up anyone who uses it could simply send over data of game plays via PR which would help update the above plots and conclusions :)

Today, Cardiff University, School of Mathematics students: James Campbell, Hannah Lorrimore as well as Google Summer of Code student Tobenna P. Igwe (PhD student at the University of Liverpool) as well as I presented the current game theoretic capabilities of Sagemath.

This talk happened as part of a two day visit to see Dima Pasechnik to work on the stuff we’ve been doing and the visit was kindly supported by CoDiMa (an EPSRC funded project to support the development of GAP and Sagemath)

Here is the video of the talk:

Here is a link to the sage worksheet we used for the talk.

Here are some photos I took during the talk:

and here are some I took of us working on code afterwards:

Here is the abstract of the talk:

Game Theory is the study of rational interaction and is getting increasingly important in CS. Ability to quickly compute a solution concept for a nontrivial (non-)cooperative game helps a lot in practical and theoretic work, as well as in teaching. This talk will describe and demonstrate the game theoretic capabilities of Sagemath (http://www.sagemath.org/), a Python library, described as having the following mission: ‘Creating a viable free opensource alternative to Magma, Maple, Mathematica and Matlab’.

The talk will describe algorithms and classes that are implemented for the computation of Nash equilibria in bimatrix games. These include:

A support enumeration algorithm;
A reverse search algorithm through the lrs library;
The Lemke-Howson algorithm using the Gambit library (https://github.com/gambitproject/gambit).

In addition to this, demonstrations of further capabilities that are actively being developed will also be given:

Tests for degeneracy in games;
A class for extensive form games which include the use of the graph theoretic capabilities of Sage.

The following two developments which are being carried out as part of a Google Summer of Code project will also be demonstrated:

An implementation of the Lemke-Howson algorithm;
Extensions to N player games;

Demonstrations will use the (free) online tool cloud.sagemath which allows anyone with connectivity to use Sage (and solve game theoretic problems!). Cloud.sagemath also serves as a great teaching and research tool with access to not only Sage but Jupyter (Ipython) notebooks, R, LaTeX and a variety of other software tools.

The talk will concentrate on strategic non-cooperative games but matching games and characteristic function games will also be briefly discussed.

Hello!
In this new blog post, I would like to discuss the inclusion of igraph library inside Sage.
Up to now, I have interfaced Sagemath with Boost graph library, in order to run Boost algorithms inside Sage. Now, I want to do the same with igraph, the other major C++ graph library, which stands out because it contains 62 routines, 29 of which are not available in Sage. Moreover, igraph library is very efficient, as shown in [1] and in the previous post on library comparison.

This inclusion of igraph in Sage is quite complicated, because we have to include a new external library [2] (while in the Boost case we already had the sources). We started this procedure through ticket 18929: unfortunately, after this ticket is closed, igraph will only be an optional package, and we will have to wait one year before it becomes standard. The disadvantage of optional packages is that they must be installed before being able to use them; however, the installation is quite easy: it is enough to run Sage with option -i python_igraph.

After the installation, the usage of igraph library is very simple, because igraph already provides a Python interface, that can be used in Sage. To transform the Sagemath network g_sage into an igraph network g_igraph, it is enough to type g_igraph=g_sage.igraph_graph(), while to create a Sagemath network from an igraph network it is enough to type g_sage = Graph(g_igraph) or g_sage=DiGraph(g_igraph). After this conversion, we can use all routines offered by igraph!
For instance, if we want to create a graph through the preferential attachment model, we can do it with the Sagemath routine, or with the igraph routine:

sage: G = graphs.RandomBarabasiAlbert(100, 2)
sage: G.num_verts()
100
sage: G = Graph(igraph.Graph.Barabasi(100, int(2)))
sage: G.num_verts()
100

The result is the same (apart from randomness), but the time is very different:

sage: import igraph
sage: %timeit G = Graph(igraph.Graph.Barabasi(10000000, int(2)))
1 loops, best of 3: 46.2 s per loop
sage: G = graphs.RandomBarabasiAlbert(10000000, 2)
Stopped after 3 hours.

Otherwise, we may use igraph to generate graphs with Forest-Fire algorithm, which is not available in Sagemath:

sage: G = Graph(igraph.Graph.Forest_Fire(10, 0.1))
sage: G.edges()
[(0, 1, None), (0, 2, None), (1, 7, None), (2, 3, None), (2, 4, None), (3, 5, None), (3, 8, None), (4, 6, None), (8, 9, None)]

We may also do the converse: transform a Sage network into an igraph network and apply an igraph algorithm. For instance, we can use label propagation to find communities (a task which is not implemented in Sage):

sage: G = graphs.CompleteGraph(5)+graphs.CompleteGraph(5)
sage: G.add_edge(0,5)
sage: com = G.igraph_graph().community_label_propagation()
sage: len(com)
2
sage: com[0]
[0, 1, 2, 3, 4]
sage: com[1]
[5, 6, 7, 8, 9]

The algorithm found the two initial cliques as communities.

I hope that these examples are enough to show the excellent possibilities offered by igraph library, and that these features will soon be available in Sagemath!

[1] https://sites.google.com/a/imtlucca.it/borassi/unpublished-works/google-summer-of-code/library-comparison
[2] http://doc.sagemath.org/html/en/developer/packaging.html

In a blog post I wrote in 2013, I showed how to simulate a discrete Markov chain. In this post we’ll (written with a bit of help from Geraint Palmer) show how to do the same with a continuous chain which can be used to speedily obtain steady state distributions for models of queueing processes for example.

A continuous Markov chain is defined by a transition rate matrix which shows the rates at which transitions from 1 state to an other occur. Here is an example of a continuous Markov chain:

This has transition rate matrix $Q$ given by:

The diagonals have negative entries, which can be interpreted as a rate of no change. To obtain the steady state probabilities $\pi$ for this chain we can solve the following matrix equation:

if we include the fact that the sum of $\pi$ must be 1 (so that it is indeed a probability vector) we can obtain the probabilities in Sagemath using the following:

You can run this here (just click on ‘Evaluate’):

Q=matrix(QQ,[[-3,2,1],[1,-5,4],[1,8,-9]])(transpose(Q).stack(vector([1,1,1])).solve_right(vector([0,0,0,1])))

This returns:

(1/4,1/2,1/4)

Thus, if we were to randomly observe this chain:

25% of the time it would be in state 1;
50% of the time it would be in state 2;
25% of the time it would be in state 3.

Now, the markov chain in question means that if we’re in the first state the rate at which a change happens to go to the second state is 2 and the rate at which a change happens that goes to the third state is 1.

This is analagous to waiting at a bus stop at the first city. Buses to the second city arrive randomly 2 per hour, and buses to the third city arrive randomly 1 per hour. Everyone waiting for a bus catches the first one that arrives. So at steady state the population will be spread amongst the three cities according to $\pi$.

Consider yourself at this bus stop. As all this is Markovian we do not care what time you arrived at the bus stop (memoryless property). You expect the bus to the second city to arrive 1/2 hours from now, with randomness, and the bus to the third city to arrive 1 hour from now, with randomness.

To simulate this we can sample two random numbers from the exponential distribution and find out which bus arrives first and ‘catch that bus’:

importrandom[random.expovariate(2),random.expovariate(1)]

The above returned (for this particular instance):

[0.5003491524841699,0.6107995795458322]

So here it’s going to take .5 hours for a bus to the second city to arrive, whereas it would take .61 hours for a bus to the third. So we would catch the bust to the second city after spending 0.5 hours at the first city.

We can use this to write a function that will take a transition rate matrix, simulate the transitions and keep track of the time spent in each state:

defsample_from_rate(rate):importrandomifrate==0:returnooreturnrandom.expovariate(rate)defsimulate_cmc(Q,time,warm_up):Q=list(Q)# In case a matrix is inputstate_space=range(len(Q))# Index the state spacetime_spent={s:0forsinstate_space}# Set up a dictionary to keep track of timeclock=0# Keep track of the clockcurrent_state=0# First statewhileclock<time:# Sample the transitionssojourn_times=[sample_from_rate(rate)forrateinQ[current_state][:current_state]]sojourn_times+=[oo]# An infinite sojourn to the same statesojourn_times+=[sample_from_rate(rate)forrateinQ[current_state][current_state+1:]]# Identify the next statenext_state=min(state_space,key=lambdax:sojourn_times[x])sojourn=sojourn_times[next_state]clock+=sojournifclock>warm_up:# Keep track if past warm up timetime_spent[current_state]+=sojourncurrent_state=next_state# Transitionpi=[time_spent[state]/sum(time_spent.values())forstateinstate_space]# Calculate probabilitiesreturnpi

Here are the probabilities from the same Markov chain as above:

which gave (on one particular run):

[0.25447326473556037,0.49567517998307603,0.24985155528136352]

This approach was used by Geraint Palmer who is doing a PhD with Paul Harper and I. He used this to verify that calculations were being carried out correctly when he was trying to fit a model. James Campbell and I are going to try to use this to get an approximation for bigger chains that cannot be solved analytically in a reasonable amount of time. In essence the simulation of the Markov chain makes sure we spend time calculating probabilities in states that are common.

If you are not familiar with Sagemath it is a free open source mathematics package that does simple things like expand algebraic expressions as well as far more complex things (optimisation, graph theory, combinatorics, game theory etc…). Cloud.sagemath is a truly amazing tool not just for Sage bu for scientific computation in general and it’s free. Completely 100% free. In this post I’ll explain why I pay for it.

A while ago, a colleague and I were having a chat about the fact that our site Maple license hadn’t been renewed fast enough (or something similar to that). My colleague was fairly annoyed by this saying something like:

‘We are kind of like professional athletes, if I played soccer at a professional club I would have the best facilities available to me. There would not be a question of me having the best boots.’

Now I don’t think we ever finished this conversation (or at least I don’t really remember what I said) but this is something that’s stayed with me for quite a while.

First of all:

I think there are probably a very large proportion of professional soccer players who do not play at the very top level and so do not enjoy having access to the very best facilities (I certainly wouldn’t consider myself the Ronaldo of mathematics…).

Secondly:

Mathematicians are (in some ways) way cooler than soccer players. We are somewhat like magicians, in the past we have not needed much more than a pencil and some paper to work our craft. Whilst a chemist/physicist/medical research needs a lab and/or other things we can pretty much work just with a whiteboard.

We are basically magicians. We can make something from nothing.

Since moving to open source software for all my research and teaching this is certainly how I’ve felt. Before discovering open source tools I needed to make sure I had the correct licence or otherwise before I could work but this is no longer the case. I just need a very basic computer (I bought a thinkpad for £60 the other day!) and I am just as powerful as I could want to be.

This is even more true with cloud.sagemath. Anyone can use a variety of scientific computing tools for no cost whatsoever (not even a cost associated with the time spent installing software): it just works. I have used this to work on sage source code with students, carry out research and also to deliver presentations: it’s awesome.

So, why do I pay $7 month to use it?

Firstly because it gives me the ability to move some projects to servers that are supposedly more robust. I have no doubt that they are more robust but in all honesty I can’t say I’ve seen problems with the ‘less’ robust servers (150 of my students used them last year and will be doing so again in the Autumn).

The main reason I pay to use cloud.sagemath is because I can afford to.

This was put in very clear terms to me during the organisation of DjangoCon Europe. The principle at Python conferences is that everyone pays to attend. This in turn ensures that funds are available for people who cannot afford to pay to attend.

I am in a lucky enough financial position that for about the price of two fancy cups of coffee a month I can help support an absolutely amazing project that helps everyone and anyone have the same powers a magician does. This helps (although my contribution is obviously a very small part of it) ensure that students and anyone else who cannot afford to help support the project, can use Sage.

Hi!
In this post, I will summarize the results obtained with the inclusion in Sage of Boost and igraph libraries. This was the main part of my Google Summer of Code project, and it was completed yesterday, when ticket 19003 was closed.

We have increased the number of graph algorithms available in Sage from 66 to 98 (according to the list used in the initial comparison of the graph libraries [1]). Furthermore, we decreased the running-time of several Sage algorithms: in some cases, we have been able to improve the asymptotic running-time, obtaining up to 10000x improvements in our tests. Finally, during the inclusion of external algorithms, we have refactored and cleaned some of Sage source code, like the shortest path routines: we have standardized the input and the output of 15 routines related to shortest paths, and we have removed duplicate code as much as possible.

More specifically, the first part of the project was the inclusion of Boost graph library: since the library is only available in C++, we had to develop an interface. This interface lets us convert easily a Sage graph into a Boost graph, and run algorithms on the converted graph. Then, we have written routines to re-translate the output into a Sage-readable format: this way, the complicated Boost library is "hidden", and users can interact with it as they do with Sage. In particular, we have interfaced the following algorithms:

Edge connectivity (trac.sagemath.org/ticket/18564);
Clustering coefficient (trac.sagemath.org/ticket/18811);
Cuthill-McKee and King vertex orderings (trac.sagemath.org/ticket/18876);
Minimum spanning tree(trac.sagemath.org/ticket/18910);
Dijkstra, Bellman-Ford, Johnson shortest paths(trac.sagemath.org/ticket/18931);

All these algorithms were either not available in Sage, or quite slow, compared to the Boost routines. As far as we know, Boost does not offer other algorithms that improve Sage algorithms: however, if such algorithms are developed in the future, it will be very easy to include them, using the new interface.

In the second part of the project, we included igraph: since this library already offers a Python interface, we decided to include it as an optional package (before it becomes a standard package, at least an year should pass [2]). To install the package, it is enough to type the following instruction from the Sage root folder:

sage -i igraph # To install the igraph C core
sage -i python_igraph # To install the Python interface

Then, we can easily interact with igraph: for a list of available routines, it is enough to type "igraph." and click tab twice. To convert a Sage graph g_sage into an igraph graph it is enough to type g_igraph = g_sage.igraph_graph(), while a Sage graph can be instantiated from an igraph graph through g_sage=Graph(g_igraph) or g_sage=DiGraph(g_igraph). This way, all igraph algorithms are now available in Sage.

Furthermore, we have included the igraph maximum flow algoritm inside the Sage corresponding function, obtaining significant improvements (for more information and benchmarks, we refer to ticket 19003 [3]).

In conclusion, I think the project reached its main goal, the original plan was followed very closely, and we have been able to overcome all problems.

Before closing this post, I would like to thank many people that helped me with great advices, and who provided great solutions to all the problems I faced. First of all, my mentor David Coudert: he always answered very fast to all my queries, and he gave me great suggestions to improve the quality of the code I wrote. Then, a very big help came from Nathann Cohen, who often cooperated with David in reviewing my code and proposing new solutions. Moreover, I have to thank Martin Cross, who gave me good suggestions with Boost graph library, and Volker Braun, who closed all my ticket. Finally, I have to thank the whole Sage community for giving me this great opportunity!

[1] https://docs.google.com/spreadsheets/d/1Iu1hkQtRn9J-sgfZbQTu2RoXzyjoMEWP5-cm3nAwnWE/edit?usp=sharing
[2] http://doc.sagemath.org/html/en/developer/coding_in_other.html
[3] http://trac.sagemath.org/ticket/19003

Since my last blog entry on the status of our implementation of Asymptotic Expressions in SageMath quite a lot of improvements have happened. Essentially, all the pieces required in order to have a basic working implementation of multivariate asymptotics are there. The remaining tasks within my Google Summer of Code project are:

Polish the documentation of our minimal prototype, which consists of #17716 and the respective dependencies. Afterwards, we will set this to needs_review.
Open a ticket for the multivariate asymptotic ring and put together everything that we have written so far there.

In this blog post I want to give some more examples of what can be done with our implementation right now and what we would like to be able to handle in the future.

Status Quo

After I wrote my last blog entry, we introduced a central idea/interface to our project: short notations. By using the short notation factory for growth groups (introduced in #18930) it becomes very simple to construct the desired growth group. Essentially, monomial growth groups (cf. #17600), i.e. groups that contain elements of the form

variable^power

(for a fixed variable and powers from some base ring, e.g. the Integer Ring or even the Rational Field) are represented by

variable^base

, where the base ring is also specified via its shortened name. The short notation factory then enables us to do the following:

sage: from sage.groups.asymptotic_growth_group import GrowthGroup
sage: G = GrowthGroup('x^ZZ'); G
Growth Group x^ZZ
sage: G.an_element()
x
sage: G = GrowthGroup('x^QQ'); G
Growth Group x^QQ
sage: G.an_element()
x^(1/2)

Naturally, this interface carries over to the generation of asymptotic rings: instead of the (slightly dubious)

"monomial"

keyword advertised in my last blog entry, we can now actually construct the growth group by specifying the respective growth group via its short representation:

sage: R.<x> = AsymptoticRing('x^ZZ', QQ); R
Asymptotic Ring <x^ZZ> over Rational Field
sage: (x^2 + O(x))^50
x^100 + O(x^99)

Recently, we also implemented another type of growth group: exponential growth groups (see #19028). These groups represent elements of the form

base^variable

where the base is from some multiplicative group. For example, we could do the following:

sage: G = GrowthGroup('QQ^x'); G
Growth Group QQ^x
sage: G.an_element()
(1/2)^x
sage: G(2^x) * G(3^x)
6^x
sage: G(5^x) * G((1/7)^x)
(5/7)^x

Note: unfortunately, we did not implement a function that allows taking some element from some growth group (e.g.

from a monomial growth group) as the variable in an exponential growth group yet. Implementing some way to “change” between growth groups by taking the log or the exponential function is one of our next steps.

We also made this short notation a central interface for working with cartesian products. This is implemented in #18587. For example, this allows to construct growth groups containing elements like $2^x \sqrt[5]{x^2} \log(x)^2$:

sage: G = GrowthGroup('QQ^x * x^QQ * log(x)^ZZ'); G
Growth Group QQ^x * x^QQ * log(x)^ZZ
sage: G.an_element()
(1/2)^x * x^(1/2) * log(x)
sage: G(2^x * x^(2/5) * log(x)^2)
2^x * x^(2/5) * log(x)^2

Simple parsing from the symbolic ring (and from strings) is implemented. Like I have written above, operations like

2^G(x)

log(G(x))

are one of the next steps on our roadmap.

Further Steps

Of course, having an easy way to generate growth groups (and thus also asymptotic rings) is nice — however, it would be even better if the process of finding the correct parent would be even more automated. Unfortunately, this requires some non-trivial effort regarding the pushout construction — which will certainly not happen within the GSoC project.

As soon as we have an efficient way to “switch” between factors of a growth group (e.g. by taking the logarithm or the exponential function), this has to be carried over up to the asymptotic ring. Operations like

sage: 2^(x^2 + O(x))
2^(x^2) * 2^(O(x))

where the output could also be

2^(x^2) * O(x^g)

, where $g$ is determined by

series_precision()

Division of asymptotic expressions can be realized with just about the same idea, for example:

\[ \frac{1}{x^2 + O(x)} = \frac{1}{x^2} \frac{1}{1 + O(1/x)} = x^{-2} + O(x^{-3}), \]

and so on. If an infinite series occurs, it will have to be cut using an $O$-Term, most likely somehow depending on

series_precision()

as well.

Ultimately, we would also like to incorporate, for example, Stirling’s approximation of the factorial such that we could do something like

sage: n.factorial()
sqrt(2*pi) * e^(n*log(n)) * (1/e)^n * n^(1/2) + ...

which then can be used to obtain asymptotic expansions of binomial coefficients like $\binom{2n}{n}$:

sage: (2*n).factorial() / (n.factorial()^2)
1/sqrt(pi) * 4^n * n^(-1/2) + ...

As you can see, there is still a lot of work within our “Asymptotic Expressions” project — nevertheless, with the minimal working prototype and the ability to create cartesian products of growth groups, the fundament for all of this is already implemented!

The “Google Summer of Code 2015” program has ended yesterday, on the 21. of August at 19.00 UTC. This blog entry shall provide a short wrap-up of our GSoC project.

The aim of our project was to implement a basic framework that enables us to do computations with asymptotic expressions in SageMath — and I am very happy to say that we very much succeeded to do so. An overview of all our developments can be found at meta ticket #17601.

Although we did not really follow the timeline suggested in my original proposal (mainly because the implementation of the Asymptotic Ring took way longer than originally anticipated) we managed to implement the majority of ideas from my proposal — with the most important part being that our current prototype is stable. In particular, this means that we do not expect to make major design changes at this point. Every detail of our design is well-discussed and can be explained.

Of course, our “Asymptotic Expressions” project is far from finished, and we will continue to extend the functionality of our framework. For example, although working with exponential and logarithmic terms is currently possible, it is not very convenient because the $\log$, $\exp$, and power functions are not fully implemented. Furthermore, it would be interesting to investigate the performance-gain obtained by cythonizing the central parts of this framework (e.g. parts of the MutablePoset…) — and so on…

To conclude, I want to thank Daniel Krenn for his hard work and helpful advice als my mentor, as well as the SageMath community for giving me the opportunity to work on this project within the Google Summer of Code program! :-)

This is a brief update to a previous post: “Python, natural language processing and predicting funny”. In that post I carried out some basic natural language processing with Python to predict whether or not a joke is funny. In this post I just update that with some more data from this year’s Edinburgh Fringe festival.

Take a look at the ipython notebook which shows graphics and outputs of all the jokes. Interestingly this year’s winning joke is not deemed funny by the basic model :) but overall was 60% right this year (which is pretty good compared to last year).

Here is a summary plot of the classifiers for different thresholds of ‘funny’:

The corresponding plot this year (with the new data):

Take a look at the notebook file and by all means grab the csv file to play (but do let me know how you get on :)).

I've been using databases and doing web development for over 20 years, and I've never really loved any database before and definitely didn't love any web development frameworks either. That all changed for me this summer...

SageMathCloud

SageMathCloud is a web application in which you collaboratively use Python, LaTeX, Markdown, Sage worksheets (sophisticated mathematics), task lists, R, Jupyter Notebooks, manage courses, write C programs, make chatrooms, and more. It is hosted on Google Compute Engine, but is also entirely open source and there is a pre-made Virtual Machine that you can download. A project in SMC is a Linux account, with resources constrained using cgroups and quotas. Many SMC users can collaborate on the same project, and have equal privileges in that project. Interaction with all file types (including Jupyter notebooks, task lists and course managements) is synchronized in realtime, like Google docs. There is also a global notifications feed that shows all editing activity on all files in all projects on which the user collaborates, which is a sort of highly technical version of Facebook's feed.

Rewrite motivation

I originally wrote the SageMathCloud frontend using progressive-refinement jQuery (no third-party framework beyond that) and the Cassandra database. These were reasonable choices when I started. There are much better approaches now, which are critical to dramatically improving the user experience with SMC, and also growing the developer base. So far SMC has had no nontrivial outside contributions, probably due to the difficulty of understanding the code. In fact, I think nobody besides me has ever even installed SMC, despite these install notes.

We (me, Jon Lee, Nicholas Ruhland) are currently completely rewriting the entire frontend of SMC using React.js, Flux, and RethinkDB. We started this rewrite in June 2015, with Jon being supported by Google Summer of Code (2015), Nich being supported some by NSF grants from Randy Leveque and Rekha Thomas, and with me being unemployed.

Terrible funding situation

I'm living on credit cards -- I have no NSF grant support anymore, and SageMathCloud is still losing a lot of money every month, and I'm unhappy about this situation. It was either completely quit working on SMC and instead teach or consult a lot, or lose tens of thousands of dollars. I am doing the latter right now. I was very caught off guard, since this is my first summer ever to not have NSF support since I got my Ph.D. in 2000, and I didn't expect to have my grant proposals all denied (which happened in June). There is some modest Angel investment in SageMath, Inc., but I can't bring myself to burn through that money on salary, since it would run out quickly, and I don't want to have to shut down the site due to not being able to pay the hosting bill. I've failed to get any significant free hosting, due to already getting free hosting in the past, and SageMath, Inc. not being in any incubators. For example, we tried very hard to get hosting from Google, but they flatly refused for these two reasons (they gave $60K in hosting to UW/Sage project in 2012). I'm clearly having trouble transitioning from an academic to an industry funding model. But if there are enough paying customers by January 2016, things will turn around.

Jon, Nich, and I have been working on this rewrite for three months, and hope to finish it by the end of September, when Jon and Nich will become busy with classes again. However, it seems unlikely we'll be able to finish at the current rate. Fortunately, I don't start teaching fulltime again until January, and we put a lot of work into doing a release in mid-August that fully uses RethinkDB and partly uses React.js, so that we can finish the second stage of the rewrite iteratively, without any major technical surprises.

RethinkDB

Cassandra is an excellent database for many applications, but it is not the right database for SMC and I'm making no further use of Cassandra. SMC is a realtime application that does a lot more reading than writing to the database, and SMC greatly benefits from realtime push updates from the database. I've tried quite hard in the past to build an appropriate architecture for SMC on top of Cassandra, but it is the wrong tool for the job. RethinkDB scales up linearly (with sharding and replication), and has high availability and automatic failover as of version 2.1.2. See https://github.com/rethinkdb/rethinkdb/issues/4678 for my painful path to ensuring RethinkDB actually works for me (the RethinkDB developers are incredibly helpful!).

React.js

I learned about React.js first from some "random podcast", then got more interested in it when Chris Swenson gave a demo at a Sage Days workshop in San Diego in May 2015. React (+Flux) is a web development framework that actually has solid ideas behind it, backed by an implementation that has been optimized and tested by a highly nontrivial real world application: namely the Facebook website. Even if I were to have the idea of React, implementing in a way that is actually usable would be difficult. The key idea of React.js is that -- surprisingly -- it is possible to write efficient client-side code that describes how to render the application purely as a function of its state.

React is different than jQuery. With jQuery, you write lots of code explaining how to transform the user interface of your application from one complicated state (that you might never have anticipated happening) to another complicated state. When using React.js you don't write code about how your application's visible state changes -- instead you write code to answer the question: "given this state, what should the application look like". For me, it's a game changer. This is like what one does when writing video games; the innovation is that some people at Facebook figured out how to practically program this way in a client side web browser application, then tuned their implementation based on huge amounts of real world data (Facebook has users). Oh, and they open sourced the result and ran several conferences explaining React.

React.js reminds me of when Andrew Wiles proved Fermat's Last Theorem in the mid 1990s. Wiles (and Ken Ribet) had genuine new ideas, which dramatically reshaped the landscape of number theory. The best number theorists quickly realized this and adopted to the new world, pushing the envelope of Wiles work far beyond what I expected could happen. Other people pretended like Wiles didn't exist and continued studying Fibonnaci numbers. I browsed the web development section of Barnes and Noble last night and there were dozens of books on jQuery and zero on React.js. I feel for anybody who tries to learn client-side web development by reading books at Barnes and Noble.

IPython/Jupyter and PhosphorJS

I recently met with Fernando Perez, who founded IPython/Jupyter. He seemed to tell me that currently 9 people are working fulltime on rewriting the Jupyter web notebook using the PhosphorJS framework. I tried to understand PhosphorJS based on the github page, but couldn't, except to deduce that it is mostly the work of one person from Bloomberg/Continuum Analytics. Fernando told me that they chose PhosphorJS since it very fast, and that their main motivation is to (1) make Jupyter better use their huge high-resolution monitors on their new institute at Berkeley, and (2) make it easier for developers like me to integrate/extend Jupyter into their applications. I don't understand (2), because PhosphorJS is perhaps the least popular web framework I've ever heard of (is it a web framework -- I can't tell?). I pushed Fernando to explain why they made that design choice, but didn't really understand the answer, except that they had spent a lot of time investigating alternatives (like React first). I'm intimidated by their resources and concerned that I'm making the wrong choice; however, I just can't understand why they have made what seems to me to be the wrong choice. I hope to understand more at the joint Sage/Jupyter Days 70 that we are organizing together in Berkeley, CA in November. (Edit: see https://github.com/ipython/ipython/issues/8239 for a discussion of why IPython/Jupyter uses PhosphorJS.)

Tables and RethinkDB

Our rewrite of SMC is built on Tables, Flux and React. Tables are client-side technology I wrote inspired by Facebook's GraphQL/Relay technology (and Meteor, Firebase, etc.); they synchronize data between clients and the backend database in realtime. Tables are defined by a JSON schema file, which specifies the fields in the table, and explains what get and set queries are allowed. A table is a subset of a much larger table in the database, with the subset defined by conditions that are relative to the user making the query. For example, the projects table has one entry for each project that the user is a collaborator on.

Tables are automatically synchronized between the user and the database whenever the database changes, using RethinkDB changefeeds. RethinkDB's innovation is to build realtime updates -- triggered when the result of a query to the database changes -- directly into the database at the lowest level. Of course it is possible to build something that looks the same from the outside using either a message queue (say using RabbitMQ or ZeroMQ), or by watching the replication stream from the database and triggering actions based on that (like Meteor does using MongoDB). RethinkDB's approach seems better to me, putting the abstraction at the right level. That said, based on mailing list traffic, searches, etc., it seems that very, very few people get RethinkDB yet. Also, despite years of development, RethinkDB only became "production ready" a few months ago, and only got automatic failover a few weeks ago. That said, after ironing out some kinks, I'm now using it with heavy traffic in production and it works very well.

Flux

Once data is automatically synchronized between the database and web browsers in realtime, we can build everything else on top of this. Facebook also introduced an architecture pattern that they call Flux, which works well with React. It's very different than MVC-style two-way binding frameworks, where objects are directly linked to UI elements, with an object changing causing the UI element to change and vice versa. In SMC each major part of the system has two objects associated to it: Actions and Stores. We think of them in terms of the classical CQRS pattern -- command query responsibility segregation. Actions are commands -- they are Javascript "functions" that get stuff done, but they do not return values; instead, they impact the state of the store. The store has functions that allow one to query for the state of the store, but they do not change the state of the store. The store functions must only be functions of the internal state of the store and nothing else. They might cache their results and format their output to be very convenient for rendering. But that's it.

Actions usually cause the corresponding store (or stores) to change. When a store changes, it emit a change event, which causes any React components that depend on the store to be updated, which in many cases means they are re-rendered. There are optimizations one can introduce to reduce the amount of re-rendering, which if one isn't careful leads to subtle bugs; pretty much the only subtle React UI bugs one hits are caused by such optimizations. When the UI re-renders, the user sees their view of the world change. The user then clicks buttons, types, etc., which triggers actions, which in turn update stores (and tables, hence propogating changes to the ultimate source of truth, which is the RethinkDB database). As stores update, the UI again updates, etc.

Status

So far, we have completely (re-)written the project listing, file manager, help/status page, new file page, project log, file finder, project settings, course management system, account settings, billing, project upgrade system, and file use notifications using React, Flux, and Tables, and the result works well. Bugs are much easier to fix, and it is easy (possible?) to understand the state of the system, since it is defined by the state of the database and the corresponding client-side stores. We've completely rethought everything about the UI in doing the rewrite of the above components, and it has taken several months. Also, as mentioned above, I completely rewrote most of the backend to use RethinkDB instead of Cassandra. There were also the weeks of misery for me after we made the switch over. Even after weeks of thinking/testing/wondering "what could go wrong?", we found out all kinds of surprising little things within hours of pushing everything into production, which took more than a week of sleep deprived days to sort out.

What's left? We have to rewrite the file editor tabs system, the project tabs system, and all the applications (except course management): editing text files using Codemirror, task lists (which are suprisingly complicated!), color xterm terminals, Jupyter notebooks (which will still use an iframe for the notebook itself), Sage worksheets (with complicated html output embedded in codemirror), compressed file de-archiver, the LaTeX editor, the wiki and markdown editors, and file chat. We hope to find a clean way to abstract away the various SMC applications as plugins, so that other people can easily write their own applications/plugins that will run inside of SMC. There will be a rich collection of example plugins to build on, namely the ones listed above, which are all driven by critical-to-us real world applications.

Discussion about this blog post on Hacker News.

The Oldenburger infinite sequence [O39] \[ K = 1221121221221121122121121221121121221221\ldots \] also known under the name of Kolakoski, is equal to its exponent trajectory. The exponent trajectory $\Delta$ can be obtained by counting the lengths of blocks of consecutive and equal letters: \[ K = 1^12^21^22^11^12^21^12^21^22^11^22^21^12^11^22^11^12^21^22^11^22^11^12^21^12^21^22^11^12^21^12^11^22^11^22^21^12^21^2\ldots \] The sequence of exponents above gives the exponent trajectory of the Oldenburger sequence: \[ \Delta = 12211212212211211221211212\ldots \] which is equal to the original sequence $K$. You can define this sequence in Sage:

sage: K = words.KolakoskiWord()
sage: K
word: 1221121221221121122121121221121121221221...
sage: K.delta()          # delta returns the exponent trajectory
word: 1221121221221121122121121221121121221221...

There are a lot of open problem related to basic properties of that sequence. For example, we do not know if that sequence is recurrent, that is, all finite subword or factor (finite block of consecutive letters) always reappear. Also, it is still open to prove whether the density of 1 in that sequence is equal to $1/2$.

In this blog post, I do some computations on its abelian complexity $p_{ab}(n)$ defined as the number of distinct abelian vectors of subwords of length $n$ in the sequence. The abelian vector $\vec{w}$ of a word $w$ counts the number of occurences of each letter: \[ w = 12211212212 \quad \mapsto \quad 1^5 2^7 \text{, abelianized} \quad \mapsto \quad \vec{w} = (5, 7) \text{, the abelian vector of } w \]

Here are the abelian vectors of subwords of length 10 and 20 in the prefix of length 100 of the Oldenburger sequence. The functions abelian_vectors and abelian_complexity are not in Sage as of now. Code is available at trac #17058 to be merged in Sage soon:

sage: prefix = words.KolakoskiWord()[:100]
sage: prefix.abelian_vectors(10)
{(4, 6), (5, 5), (6, 4)}
sage: prefix.abelian_vectors(20)
{(8, 12), (9, 11), (10, 10), (11, 9), (12, 8)}

Therefore, the prefix of length 100 has 3 vectors of subwords of length 10 and 5 vectors of subwords of length 20:

sage: p100.abelian_complexity(10)
3
sage: p100.abelian_complexity(20)
5

I import the OldenburgerSequence from my optional spkg because it is faster than the implementation in Sage:

sage: from slabbe import KolakoskiWord as OldenburgerSequence
sage: Olden = OldenburgerSequence()

I count the number of abelian vectors of subwords of length 100 in the prefix of length $2^{20}$ of the Oldenburger sequence:

sage: prefix = Olden[:2^20]
sage: %time prefix.abelian_vectors(100)
CPU times: user 3.48 s, sys: 66.9 ms, total: 3.54 s
Wall time: 3.56 s
{(47, 53), (48, 52), (49, 51), (50, 50), (51, 49), (52, 48), (53, 47)}

Number of abelian vectors of subwords of length less than 100 in the prefix of length $2^{20}$ of the Oldenburger sequence:

sage: %time L100 = map(prefix.abelian_complexity, range(100))
CPU times: user 3min 20s, sys: 1.08 s, total: 3min 21s
Wall time: 3min 23s
sage: from collections import Counter
sage: Counter(L100)
Counter({5: 26, 6: 26, 4: 17, 7: 15, 3: 8, 8: 4, 2: 3, 1: 1})

Let's draw that:

sage: labels = ('Length of factors', 'Number of abelian vectors')
sage: title = 'Abelian Complexity of the prefix of length $2^{20}$ of Oldenburger sequence'
sage: list_plot(L100, color='green', plotjoined=True, axes_labels=labels, title=title)

It seems to grow something like $\log(n)$. Let's now consider subwords of length $2^n$ for $0\leq n\leq 12$ in the same prefix of length $2^{20}$:

sage: %time L20 = [(2^n, prefix.abelian_complexity(2^n)) for n in range(20)]
CPU times: user 41 s, sys: 239 ms, total: 41.2 s
Wall time: 41.5 s
sage: L20
[(1, 2), (2, 3), (4, 3), (8, 3), (16, 3), (32, 5), (64, 5), (128, 9),
(256, 9), (512, 13), (1024, 17), (2048, 22), (4096, 27), (8192, 40),
(16384, 46), (32768, 67), (65536, 81), (131072, 85), (262144, 90), (524288, 104)]

I now look at subwords of length $2^n$ for $0\leq n\leq 23$ in the longer prefix of length $2^{24}$:

sage: prefix = Olden[:2^24]
sage: %time L24 = [(2^n, prefix.abelian_complexity(2^n)) for n in range(24)]
CPU times: user 20min 47s, sys: 13.5 s, total: 21min
Wall time: 20min 13s
sage: L24
[(1, 2), (2, 3), (4, 3), (8, 3), (16, 3), (32, 5), (64, 5), (128, 9), (256,
9), (512, 13), (1024, 17), (2048, 23), (4096, 33), (8192, 46), (16384, 58),
(32768, 74), (65536, 98), (131072, 134), (262144, 165), (524288, 229),
(1048576, 302), (2097152, 371), (4194304, 304), (8388608, 329)]

The next graph gather all of the above computations:

sage: G = Graphics()
sage: legend = 'in the prefix of length 2^{}'
sage: G += list_plot(L24, plotjoined=True, thickness=4, color='blue', legend_label=legend.format(24))
sage: G += list_plot(L20, plotjoined=True, thickness=4, color='red', legend_label=legend.format(20))
sage: G += list_plot(L100, plotjoined=True, thickness=4, color='green', legend_label=legend.format(20))
sage: labels = ('Length of factors', 'Number of abelian vectors')
sage: title = 'Abelian complexity of Oldenburger sequence'
sage: G.show(scale=('semilogx', 2), axes_labels=labels, title=title)

/Files/2014/oldenburger_abelian_2e24.png

A linear growth in the above graphics with logarithmic $x$ abcisse would mean a growth in $\log(n)$. After those experimentations, my hypothesis is that the abelian complexity of the Oldenburger sequence grows like $\log(n)^2$.

References

[O39]

Oldenburger, Rufus (1939). "Exponent trajectories in symbolic dynamics". Transactions of the American Mathematical Society 46: 453–466. doi:10.2307/1989933

In a recent article with Valérie Berthé [BL15], we provided a multidimensional continued fraction algorithm called Arnoux-Rauzy-Poincaré (ARP) to construct, given any vector $v\in\mathbb{R}_+^3$, an infinite word $w\in\{1,2,3\}^\mathbb{N}$ over a three-letter alphabet such that the frequencies of letters in $w$ exists and are equal to $v$ and such that the number of factors (i.e. finite block of consecutive letters) of length $n$ appearing in $w$ is linear and less than $\frac{5}{2}n+1$. We also conjecture that for almost all $v$ the contructed word describes a discrete path in the positive octant staying at a bounded distance from the euclidean line of direction $v$.

In Sage, you can construct this word using the next version of my package slabbe-0.2 (not released yet, email me to press me to finish it). The one with frequencies of letters proportionnal to $(1, e, \pi)$ is:

sage: from slabbe.mcf import algo
sage: D = algo.arp.substitutions()
sage: it = algo.arp.coding_iterator((1,e,pi))
sage: w = words.s_adic(it, repeat(1), D)
word: 1232323123233231232332312323123232312323...

The factor complexity is close to 2n+1 and the balance is often less or equal to three:

sage: w[:10000].number_of_factors(100)
202
sage: w[:100000].number_of_factors(1000)
2002
sage: w[:1000].balance()
3
sage: w[:2000].balance()
3

Note that bounded distance from the euclidean line almost surely was proven in [DHS2013] for Brun algorithm, another MCF algorithm.

Other approaches: Standard model and billiard sequences

Other approaches have been proposed to construct such discrete lines.

One of them is the standard model of Eric Andres [A03]. It is also equivalent to billiard sequences in the cube. It is well known that the factor complexity of billiard sequences is quadratic $p(n)=n^2+n+1$ [AMST94]. Experimentally, we can verify this. We first create a billiard word of some given direction:

sage: from slabbe import BilliardCube
sage: v = vector(RR, (1, e, pi))
sage: b = BilliardCube(v)
sage: b
Cubic billiard of direction (1.00000000000000, 2.71828182845905, 3.14159265358979)
sage: w = b.to_word()
sage: w
word: 3231232323123233213232321323231233232132...

We create some prefixes of $w$ that we represent internally as char*. The creation is slow because the implementation of billiard words in my optional package is in Python and is not that efficient:

sage: p3 = Word(w[:10^3], alphabet=[1,2,3], datatype='char')
sage: p4 = Word(w[:10^4], alphabet=[1,2,3], datatype='char') # takes 3s
sage: p5 = Word(w[:10^5], alphabet=[1,2,3], datatype='char') # takes 32s
sage: p6 = Word(w[:10^6], alphabet=[1,2,3], datatype='char') # takes 5min 20s

We see below that exactly $n^2+n+1$ factors of length $n<20$ appears in the prefix of length 1000000 of $w$:

sage: A = ['n'] + range(30)
sage: c3 = ['p_(w[:10^3])(n)'] + map(p3.number_of_factors, range(30))
sage: c4 = ['p_(w[:10^4])(n)'] + map(p4.number_of_factors, range(30))
sage: c5 = ['p_(w[:10^5])(n)'] + map(p5.number_of_factors, range(30)) # takes 4s
sage: c6 = ['p_(w[:10^6])(n)'] + map(p6.number_of_factors, range(30)) # takes 49s
sage: ref = ['n^2+n+1'] + [n^2+n+1 for n in range(30)]
sage: T = table(columns=[A,c3,c4,c5,c6,ref])
sage: T
  n    p_(w[:10^3])(n)   p_(w[:10^4])(n)   p_(w[:10^5])(n)   p_(w[:10^6])(n)   n^2+n+1
+----+-----------------+-----------------+-----------------+-----------------+---------+
  0    1                 1                 1                 1                 1
  1    3                 3                 3                 3                 3
  2    7                 7                 7                 7                 7
  3    13                13                13                13                13
  4    21                21                21                21                21
  5    31                31                31                31                31
  6    43                43                43                43                43
  7    52                55                56                57                57
  8    63                69                71                73                73
  9    74                85                88                91                91
  10   87                103               107               111               111
  11   100               123               128               133               133
  12   115               145               151               157               157
  13   130               169               176               183               183
  14   144               195               203               211               211
  15   160               223               232               241               241
  16   176               253               263               273               273
  17   192               285               296               307               307
  18   208               319               331               343               343
  19   224               355               368               381               381
  20   239               392               407               421               421
  21   254               430               448               463               463
  22   268               470               491               507               507
  23   282               510               536               553               553
  24   296               552               583               601               601
  25   310               596               632               651               651
  26   324               642               683               703               703
  27   335               687               734               757               757
  28   345               734               787               813               813
  29   355               783               842               871               871

Billiard sequences generate paths that are at a bounded distance from an euclidean line. This is equivalent to say that the balance is finite. The balance is defined as the supremum value of difference of the number of apparition of a letter in two factors of the same length. For billiard sequences, the balance is 2:

sage: p3.balance()
2
sage: p4.balance() # takes 2min 37s
2

Other approaches: Melançon and Reutenauer

Melançon and Reutenauer [MR13] also suggested a method that generalizes Christoffel words in higher dimension. The construction is based on the application of two substitutions generalizing the construction of sturmian sequences. Below we compute the factor complexity and the balance of some of their words over a three-letter alphabet.

On a three-letter alphabet, the two morphisms are:

sage: L = WordMorphism('1->1,2->13,3->2')
sage: R = WordMorphism('1->13,2->2,3->3')
sage: L
WordMorphism: 1->1, 2->13, 3->2
sage: R
WordMorphism: 1->13, 2->2, 3->3

Example 1: periodic case $LRLRLRLRLR\dots$. In this example, the factor complexity seems to be around $p(n)=2.76n$ and the balance is at least 28:

sage: from itertools import repeat, cycle
sage: W = words.s_adic(cycle((L,R)),repeat('1'))
sage: W
word: 1213122121313121312212212131221213131213...
sage: map(W[:10000].number_of_factors, [10,20,40,80])
[27, 54, 110, 221]
sage: [27/10., 54/20., 110/40., 221/80.]
[2.70000000000000, 2.70000000000000, 2.75000000000000, 2.76250000000000]
sage: W[:1000].balance()  # takes 1.6s
21
sage: W[:2000].balance()  # takes 6.4s
28

Example 2: $RLR^2LR^4LR^8LR^{16}LR^{32}LR^{64}LR^{128}\dots$ taken from the conclusion of their article. In this example, the factor complexity seems to be $p(n)=3n$ and balance at least as high (=bad) as $122$:

sage: W = words.s_adic([R,L,R,R,L,R,R,R,R,L]+[R]*8+[L]+[R]*16+[L]+[R]*32+[L]+[R]*64+[L]+[R]*128,'1')
sage: W.length()
330312
sage: map(W.number_of_factors, [10, 20, 100, 200, 300, 1000])
[29, 57, 295, 595, 895, 2981]
sage: [29/10., 57/20., 295/100., 595/200., 895/300., 2981/1000.]
[2.90000000000000,
 2.85000000000000,
 2.95000000000000,
 2.97500000000000,
 2.98333333333333,
 2.98100000000000]
sage: W[:1000].balance()  # takes 1.6s
122
sage: W[:2000].balance()  # takes 6s
122

Example 3: some random ones. The complexity $p(n)/n$ occillates between 2 and 3 for factors of length $n=1000$ in prefixes of length 100000:

sage: for _ in range(10):
....:     W = words.s_adic([choice((L,R)) for _ in range(50)],'1')
....:     print W[:100000].number_of_factors(1000)/1000.
2.02700000000000
2.23600000000000
2.74000000000000
2.21500000000000
2.78700000000000
2.52700000000000
2.85700000000000
2.33300000000000
2.65500000000000
2.51800000000000

For ten randomly generated words, the balance goes from 6 to 27 which is much more than what is obtained for billiard words or by our approach:

sage: for _ in range(10):
....:     W = words.s_adic([choice((L,R)) for _ in range(50)],'1')
....:     print W[:1000].balance(), W[:2000].balance()
12 15
8 24
14 14
5 11
17 17
14 14
6 6
19 27
9 16
12 12

References

[BL15]

V. Berthé, S. Labbé, Factor Complexity of S-adic words generated by the Arnoux-Rauzy-Poincaré Algorithm, Advances in Applied Mathematics 63 (2015) 90-130. http://dx.doi.org/10.1016/j.aam.2014.11.001

[DHS2013]

Delecroix, Vincent, Tomás Hejda, and Wolfgang Steiner. “Balancedness of Arnoux-Rauzy and Brun Words.” In Combinatorics on Words, 119–31. Springer, 2013. http://link.springer.com/chapter/10.1007/978-3-642-40579-2_14.

[A03]

E. Andres, Discrete linear objects in dimension n: the standard model, Graphical Models 65 (2003) 92-111.

[AMST94]

P. Arnoux, C. Mauduit, I. Shiokawa, J. I. Tamura, Complexity of sequences defined by billiards in the cube, Bull. Soc. Math. France 122 (1994) 1-12.

[MR13]

G. Melançon, C. Reutenauer, On a class of Lyndon words extending Christoffel words and related to a multidimensional continued fraction algorithm. J. Integer Seq. 16, No. 9, Article 13.9.7, 30 p., electronic only (2013). https://cs.uwaterloo.ca/journals/JIS/VOL16/Reutenauer/reut3.html