In my previous post I gave a general overview of what I want to do during this year's Google Summer of Code. The first six weeks I worked on extending the VB parser and lexer used by SharpDevelop. I have finished this, but it was a very complex and challenging task.
The challenge begins ...
One of the most complicated features of VB .NET is its support for XML literals. See this post by Jim O'Neil from Microsoft for more information. When I first read the post and saw the examples, I was quite shocked. I was not sure how it was possible to differentiate XML literals from normal VB code. Even VB comments introduced by ' or REM are not valid inside XML literals. As the single-quote character can be used for XML string literals, it is necessary to treat them differently in "XML mode".
Speaking of "XML mode" it can be said, that a VB source file can contain code/markup in two languages: VB Code and XML markup. I had a lot of discussions about this topic with my mentor. He helped me a lot. Soon it was clear to us, that it would be the best solution to deal with the question "Are we now in XML mode or normal mode?" only on the level of the lexer. More precisely the lexer should return different tokens for XML and normal mode. To avoid ambiguities in the parser's EBNF we chose to do it inside the lexer. For example: the '<' and '>' can be intepreted either as LowerThan/GreaterThan or XmlOpenTag/XmlCloseTag.
Are we now in XML mode or normal mode?
To solve this question two other questions need to be asked and answered: "When does XML mode start?" and "When does XML mode end?"
I started playing around with VB sample code I found on the net and tried out how vbc.exe and VS handle invalid XML literals or VB code mixed with XML literals. One of the most important information I found was a post by Lucian Wischik. In his post you can find a link to a "hyperlinked grammar" of VB10. It was very interesting to jump through the grammar and see where certain language elements can be used. I found out that wherever an expression can be used, an XML literal can be used too.
Dim xml = <Test />
CallMe(<Test a="value">This is some text content, but not much</Test>, 3 < a, "Test")
The first example is quite simple. After an assign sign an expression can start, so XML literals are allowed. The second example is more complex. The first argument for the call to the method "CallMe" is an XML literal, the second argument is a boolean expression, the third is a simple string literal.
As you can see at "text content, " we're still in XML mode, so the comma is not a parameter separator. But what if the XML literal is not valid? This question is important, because when creating code completion support, you almost never have completely valid code. One answer I found after experimenting with VS and vbc, was: the XML mode is active until every element is closed.
Dim xml = <Test>asdfasdf
Dim i = 4
In this code the XML literal is not valid, and will cause a compile time error. "asdfasdf Dim i = 4" is seen as the string content of the Test element, which is never closed.
Here are some more rules:
- an XML document needs to be started with an XML declaration. Without the declaration the root element is interpreted as single element. If you start with an XML comment, it is seen as a single comment and everything after it is normal VB code again.
- after the end of the root element of an XML document you can write any number of comments.
Coming back to the first example we can see the second parameter of our call is a boolean expression. From the viewpoint of the lexer, it is just '3', ' ', '<', ' ', 'a' ... characters.
It cannot tell that this is a valid boolean expression. But while experimenting we found the following rule: '<' only introduces an XML literal if it is at the start of an expression.
3 < a is: Literal "<" Identifier
So we can see it starts with a literal rather than the "<", so it is definitely not an XML literal. Good ... but, wait! How does the lexer know that? The parser would know it, because it knows about the language structure, but the lexer operates on character level. The solution is to make the lexer more intelligent, it needs to know about the structure of the language in detail. But that's enough for now; we'll discuss the lexer implementation in my next post.