README.asciidoc 13 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471
  1. = Nex =
  2. Nex is a lexer similar to Lex/Flex that:
  3. - generates Go code instead of C code
  4. - integrates with Go's yacc instead of YACC/Bison
  5. - supports UTF-8
  6. - supports nested _structural regular expressions_.
  7. See http://doc.cat-v.org/bell_labs/structural_regexps/se.pdf[Structural
  8. Regular Expressions] by Rob Pike. I wrote this code to get acquainted with Go
  9. and also to explore some of the ideas in the paper. Also, I've always been
  10. meaning to implement algorithms I learned from a compilers course I took many
  11. years ago. Back then, we never coded them; merely understanding the theory was
  12. enough to pass the exam.
  13. Go has a less general http://golang.org/pkg/scanner/[scanner package],
  14. but it is especially suited for tokenizing Go code.
  15. == Installation ==
  16. $ export GOPATH=/tmp/go
  17. $ go get github.com/blynn/nex
  18. == Example ==
  19. http://flex.sourceforge.net/manual/Simple-Examples.html[One simple example in
  20. the Flex manual] is a scanner that counts characters and lines. The program is
  21. similar in Nex:
  22. ------------------------------------------
  23. /\n/{ nLines++; nChars++ }
  24. /./{ nChars++ }
  25. //
  26. package main
  27. import ("fmt";"os")
  28. func main() {
  29. var nLines, nChars int
  30. NN_FUN(NewLexer(os.Stdin))
  31. fmt.Printf("%d %d\n", nLines, nChars)
  32. }
  33. ------------------------------------------
  34. The syntax resembles Awk more than Flex: each regex must be delimited. An empty
  35. regex terminates the rules section and signifies the presence of user code,
  36. which is printed on standard output with `NN_FUN` replaced by the generated
  37. scanner.
  38. Name the above example `lc.nex`. Then compile and run it by typing:
  39. $ nex -r -s lc.nex
  40. The program runs on standard input and output. For example:
  41. $ nex -r -s lc.nex < /usr/share/dict/words
  42. 99171 938587
  43. To generate Go code for a scanner without compiling and running it, type:
  44. $ nex -s < lc.nex # Prints code on standard output.
  45. or:
  46. $ nex -s lc.nex # Writes code to lc.nn.go
  47. The `NN_FUN` macro is primitive, but I was unable to think of another way to
  48. achieve an Awk-esque feel. Purists unable to tolerate text substitution will
  49. need more code:
  50. ------------------------------------------
  51. /\n/{ lval.l++; lval.c++ }
  52. /./{ lval.c++ }
  53. //
  54. package main
  55. import ("fmt";"os")
  56. type yySymType struct { l, c int }
  57. func main() {
  58. v := new(yySymType)
  59. NewLexer(os.Stdin).Lex(v)
  60. fmt.Printf("%d %d\n", v.l, v.c)
  61. }
  62. ------------------------------------------
  63. and must run `nex` without the `-s` option:
  64. $ nex lc.nex
  65. We could avoid defining a struct by using globals instead, but even then we
  66. need a throwaway definition of yySymType.
  67. The yy prefix can be modified by adding `-y` option. When using yacc, it must use the same prefix:
  68. $ nex -p YY lc.nex && go tool yacc -p YY && go run lc.nn.go y.go
  69. == Toy Pascal ==
  70. The Flex manual also exhibits a http://flex.sourceforge.net/manual/Simple-Examples.html[scanner for a toy Pascal-like language],
  71. though last I checked, its comment regex was a little buggy. Here is a
  72. modified Nex version, without string-to-number conversions:
  73. ------------------------------------------
  74. /[0-9]+/ { println("An integer:", txt()) }
  75. /[0-9]+\.[0-9]*/ { println("A float:", txt()) }
  76. /if|then|begin|end|procedure|function/
  77. { println( "A keyword:", txt()) }
  78. /[a-z][a-z0-9]*/ { println("An identifier:", txt()) }
  79. /\+|-|\*|\// { println("An operator:", txt()) }
  80. /[ \t\n]+/ { /* eat up whitespace */ }
  81. /./ { println("Unrecognized character:", txt()) }
  82. /{[^\{\}\n]*}/ { /* eat up one-line comments */ }
  83. //
  84. package main
  85. import "os"
  86. func main() {
  87. lex := NewLexer(os.Stdin)
  88. txt := func() string { return lex.Text() }
  89. NN_FUN(lex)
  90. }
  91. ------------------------------------------
  92. Enough simple examples! Let us see what nesting can do.
  93. == Peter into silicon ==
  94. In ``Structural Regular Expressions'', Pike imagines a newline-agnostic Awk
  95. that operates on matched text, rather than on the whole line containing a
  96. match, and writes code converting an input array of characters into
  97. descriptions of rectangles. For example, given an input such as:
  98. ------------------------------------------
  99. #######
  100. #########
  101. #### #####
  102. #### #### #
  103. #### #####
  104. #### ###
  105. ######## #####
  106. #### #########
  107. #### # # ####
  108. ## # ### ##
  109. ### # ###
  110. ### ##
  111. ## #
  112. # ####
  113. # #
  114. ## # ##
  115. ------------------------------------------
  116. we wish to produce something like:
  117. ------------------------------------------
  118. rect 5 12 1 2
  119. rect 4 13 2 3
  120. rect 3 7 3 4
  121. rect 9 14 3 4
  122. ...
  123. rect 10 12 16 17
  124. ------------------------------------------
  125. With Nex, we don't have to imagine: such programs are real. Below are practical
  126. Nex programs that strongly resemble their theoretical counterparts.
  127. The one-character-at-a-time variant:
  128. ------------------------------------------
  129. / /{ x++ }
  130. /#/{ println("rect", x, x+1, y, y+1); x++ }
  131. /\n/{ x=1; y++ }
  132. //
  133. package main
  134. import "os"
  135. func main() {
  136. x, y := 1, 1
  137. NN_FUN(NewLexer(os.Stdin))
  138. }
  139. ------------------------------------------
  140. The one-run-at-a-time variant:
  141. ------------------------------------------
  142. / +/{ x+=len(txt()) }
  143. /#+/{ println("rect", x, x+len(txt()), y, y+1); x+=len(txt()) }
  144. /\n/{ x=1; y++ }
  145. //
  146. package main
  147. import "os"
  148. func main() {
  149. x, y := 1, 1
  150. lex := NewLexer(os.Stdin)
  151. txt := func() string { return lex.Text() }
  152. NN_FUN(lex)
  153. }
  154. ------------------------------------------
  155. The programs are more verbose than Awk because Go is the backend.
  156. == Rob but not robot ==
  157. Pike demonstrates how nesting structural expressions leads to a few simple text
  158. editor commands to print all lines containing "rob" but not "robot". Though Nex
  159. fails to separate looping from matching, a corresponding program is bearable:
  160. ------------------------------------------
  161. /[^\n]*\n/ < { isrobot = false; isrob = false }
  162. /robot/ { isrobot = true }
  163. /rob/ { isrob = true }
  164. > { if isrob && !isrobot { fmt.print(lex.Text()) } }
  165. //
  166. package main
  167. import ("fmt";"os")
  168. func main() {
  169. var isrobot, isrob bool
  170. lex := NewLexer(os.Stdin)
  171. NN_FUN(lex)
  172. }
  173. ------------------------------------------
  174. The "<" and ">" delimit nested expressions, and work as follows.
  175. On reading a line, we find it matches the first regex, so we execute the code
  176. immediately following the opening "<".
  177. Then it's as if we run Nex again, except we focus only on the patterns and
  178. actions up to the closing ">", with the matched line as the entire input. Thus
  179. we look for occurrences of "rob" and "robot" in just the matched line and set
  180. flags accordingly.
  181. After the line ends, we execute the code following the closing ">" and return
  182. to our original state, scanning for more lines.
  183. == Word count ==
  184. We can simultaneously count lines, words, and characters with Nex thanks to
  185. nesting:
  186. ------------------------------------------
  187. /[^\n]*\n/ < {}
  188. /[^ \t\r\n]*/ < {}
  189. /./ { nChars++ }
  190. > { nWords++ }
  191. /./ { nChars++ }
  192. > { nLines++ }
  193. //
  194. package main
  195. import ("fmt";"os")
  196. func main() {
  197. var nLines, nWords, nChars int
  198. NN_FUN(NewLexer(os.Stdin))
  199. fmt.Printf("%d %d %d\n", nLines, nWords, nChars)
  200. }
  201. ------------------------------------------
  202. The first regex matches entire lines: each line is passed to the first level
  203. of nested regexes. Within this level, the first regex matches words in the
  204. line: each word is passed to the second level of nested regexes. Within
  205. the second level, a regex causes every character of the word to be counted.
  206. Lastly, we also count whitespace characters, a task performed by the second
  207. regex of the first level of nested regexes. We could remove this statement
  208. to count only non-whitespace characters.
  209. == UTF-8 ==
  210. The following Nex program converts Eastern Arabic numerals to the digits used
  211. in the Western world, and also Chinese phrases for numbers (the analog of
  212. something like "one-hundred and fifty-three") into digits.
  213. ------------------------------------------
  214. /[零一二三四五六七八九十百千]+/ { fmt.Print(zhToInt(txt())) }
  215. /[٠-٩]/ {
  216. // The above character class might show up right-to-left in a browser.
  217. // The equivalent of 0 should be on the left, and the equivalent of 9 should
  218. // be on the right.
  219. //
  220. // The Eastern Arabic numerals are ٠١٢٣٤٥٦٧٨٩.
  221. fmt.Print([]rune(txt())[0] - rune('٠'))
  222. }
  223. /./ { fmt.Print(txt()) }
  224. //
  225. package main
  226. import ("fmt";"os")
  227. func zhToInt(s string) int {
  228. n := 0
  229. prev := 0
  230. f := func(m int) {
  231. if 0 == prev { prev = 1 }
  232. n += m * prev
  233. prev = 0
  234. }
  235. for _, c := range s {
  236. for m, v := range []rune("一二三四五六七八九") {
  237. if v == c {
  238. prev = m+1
  239. goto continue2
  240. }
  241. }
  242. switch c {
  243. case '零':
  244. case '十': f(10)
  245. case '百': f(100)
  246. case '千': f(1000)
  247. }
  248. continue2:
  249. }
  250. n += prev
  251. return n
  252. }
  253. func main() {
  254. lex := NewLexer(os.Stdin)
  255. txt := func() string { return lex.Text() }
  256. NN_FUN(lex)
  257. }
  258. ------------------------------------------
  259. == nex and Go's yacc ==
  260. The parser generated by `go tool yacc` exports so little that it's easiest to
  261. keep the lexer and the parser in the same package.
  262. Here's a yacc file based on the
  263. http://dinosaur.compilertools.net/bison/bison_5.html[reverse-Polish-notation
  264. calculator example from the Bison manual]:
  265. ------------------------------------------
  266. %{
  267. package main
  268. import "fmt"
  269. %}
  270. %union {
  271. n int
  272. }
  273. %token NUM
  274. %%
  275. input: /* empty */
  276. | input line
  277. ;
  278. line: '\n'
  279. | exp '\n' { fmt.Println($1.n); }
  280. ;
  281. exp: NUM { $$.n = $1.n; }
  282. | exp exp '+' { $$.n = $1.n + $2.n; }
  283. | exp exp '-' { $$.n = $1.n - $2.n; }
  284. | exp exp '*' { $$.n = $1.n * $2.n; }
  285. | exp exp '/' { $$.n = $1.n / $2.n; }
  286. /* Unary minus */
  287. | exp 'n' { $$.n = -$1.n; }
  288. ;
  289. %%
  290. ------------------------------------------
  291. We must import `fmt` even if we don't use it, since code generated by yacc
  292. needs it. Also, the `%union` is mandatory; it generates `yySymType`.
  293. Call the above `rp.y`. Then a suitable lexer, say `rp.nex`, might be:
  294. ------------------------------------------
  295. /[ \t]/ { /* Skip blanks and tabs. */ }
  296. /[0-9]*/ { lval.n,_ = strconv.Atoi(yylex.Text()); return NUM }
  297. /./ { return int(yylex.Text()[0]) }
  298. //
  299. package main
  300. import ("os";"strconv")
  301. func main() {
  302. yyParse(NewLexer(os.Stdin))
  303. }
  304. ------------------------------------------
  305. Compile the two with:
  306. $ nex rp.nex && go tool yacc rp.y && go build y.go rp.nn.go
  307. For brevity, we work in the `main` package. In a larger project we might want
  308. to write a package that exports a function wrapped around `yyParse()`. This is
  309. fine, provided the parser and the lexer are both in the same package.
  310. Alternatively, we could use yacc's `-p` option to change the prefix from `yy`
  311. to one that begins with an uppercase letter.
  312. == Matching the beginning and end of input ==
  313. We can simulate awk's BEGIN and END blocks with a regex that matches the entire
  314. input:
  315. ------------------------------------------
  316. /.*/ < { println("BEGIN") }
  317. /a/ { println("a") }
  318. > { println("END") }
  319. //
  320. package main
  321. import "os"
  322. func main() {
  323. NN_FUN(NewLexer(os.Stdin))
  324. }
  325. ------------------------------------------
  326. However, this causes Nex to read the entire input into memory. To solve
  327. this problem, Nex supports the following syntax:
  328. ------------------------------------------
  329. < { println("BEGIN") }
  330. /a/ { println("a") }
  331. > { println("END") }
  332. package main
  333. import "os"
  334. func main() {
  335. NN_FUN(NewLexer(os.Stdin))
  336. }
  337. ------------------------------------------
  338. In other words, if a bare '<' appears as the first pattern, then its action is
  339. executed before reading the input. The last pattern must be a bare '>', and its
  340. action is executed on end of input.
  341. Additionally, no empty regex is needed to mark the beginning of the Go program.
  342. (Fortunately, an empty regex is also a Go comment, so there's no harm done if
  343. present.)
  344. == Matching Nuances ==
  345. Among rules in the same scope, the longest matching pattern takes precedence.
  346. In event of a tie, the first pattern wins.
  347. Unanchored patterns never match the empty string. For example,
  348. /(foo)*/ {}
  349. matches "foo" and "foofoo", but not "".
  350. Anchored patterns can match the empty string at most once; after the match, the
  351. start or end null strings are "used up" so will not match again.
  352. Internally, this is implemented by omitting the very first check to see if the
  353. current state is accepted when running the DFA corresponding to the regex. An
  354. alternative would be to simply ignore matches of length 0, but I chose to allow
  355. anchored empty matches just in case there turn out to be applications for them.
  356. I'm open to changing this behaviour.
  357. == Contributing and Testing ==
  358. Check out this repo (or a clone) into a directory with the following structure:
  359. mkdir -p nex/src
  360. cd nex/src
  361. git clone https://github.com/blynn/nex.git
  362. The Makefile will put the binary into e.g. nex/bin
  363. == Reference ==
  364. func NewLexer(in io.Reader) *Lexer
  365. // NewLexerWithInit creates a new Lexer object, runs the given callback on it,
  366. // then returns it.
  367. func NewLexerWithInit(in io.Reader, initFun func(*Lexer)) *Lexer
  368. // Lex runs the lexer. Always returns 0.
  369. // When the -s option is given, this function is not generated;
  370. // instead, the NN_FUN macro runs the lexer.
  371. func (yylex *Lexer) Lex(lval *yySymType) int
  372. // Text returns the matched text.
  373. func (yylex *Lexer) Text() string
  374. // Line returns the current line number.
  375. // The first line is 0.
  376. func (yylex *Lexer) Line() int
  377. // Column returns the current column number.
  378. // The first column is 0.
  379. func (yylex *Lexer) Column() int