【手寫簡易浏覽器】html parser 篇

思路分析

實作 html parser 主要分為詞法分析和文法分析兩步。

詞法分析

詞法分析需要把每一種類型的 token 識别出來，具體的類型有：

開始标簽，如 <div>
結束标簽，如 </div>
注釋标簽，如
doctype 标簽，如 <!doctype html>
text，如 aaa

這是最外層的 token，開始标簽内部還要分出屬性，如 id="aaa" 這種。

也就是有這幾種情況：

第一層判斷是否包含 <，如果不包含則是 text，如果包含則再判斷是哪一種，如果是開始标簽，還要對其内容再取屬性，直到遇到 > 就重新判斷。

文法分析

文法分析就是對上面分出的 token 進行組裝，生成 ast。

html 的 ast 的組裝主要是考慮父子關系，記錄目前的 parent，然後 text、children 都設定到目前 parent 上。

我們來用代碼實作一下：

代碼實作

詞法分析

首先，我們要把 startTag、endTag、comment、docType 還有 attribute 的正規表達式寫出來：

正則

結束标簽就是 </ 開頭，然後 a-zA-Z0-9 和 - 出現多次，之後是 >

const endTagReg = /^<\/([a-zA-Z0-9\-]+)>/;

注釋标簽是  中間夾着非 --> 字元出現任意次

const commentReg = /^<!\-\-[^(-->)]*\-\->/;

doctype 标簽是 <!doctype 加非 > 字元出現多次，加 >

const docTypeReg = /^<!doctype [^>]+>/;

attribute 是多個空格開始，加 a-zA-Z0-9 或 - 出現多次，接一個 =，之後是非 > 字元出多次

const attributeReg = /^(?:[ ]+([a-zA-Z0-9\-]+=[^>]+))/;

開始标簽是 < 開頭，接 a-zA-Z0-9 和 - 出現多次，然後是屬性的正則，最後是 > 結尾

const startTagReg = /^<([a-zA-Z0-9\-]+)(?:([ ]+[a-zA-Z0-9\-]+=[^> ]+))*>/;

分詞

之後，我們就可以基于這些正則來分詞，第一層處理 < 和 text：

function parse(html, options) {
    function advance(num) {
        html = html.slice(num);
    }

    while(html){
        if(html.startsWith('<')) {
            //...
        } else {
            let textEndIndex = html.indexOf('<');
            options.onText({
                type: 'text',
                value: html.slice(0, textEndIndex)
            });
            textEndIndex = textEndIndex === -1 ? html.length: textEndIndex;
            advance(textEndIndex);
        }
    }
}

第二層處理 <!-- 和 <!doctype 和結束标簽、開始标簽：

const commentMatch = html.match(commentReg);
if (commentMatch) {
    options.onComment({
        type: 'comment',
        value: commentMatch[0]
    })
    advance(commentMatch[0].length);
    continue;
}

const docTypeMatch = html.match(docTypeReg);
if (docTypeMatch) {
    options.onDoctype({
        type: 'docType',
        value: docTypeMatch[0]
    });
    advance(docTypeMatch[0].length);
    continue;
}

const endTagMatch = html.match(endTagReg);
if (endTagMatch) {
    options.onEndTag({
        type: 'tagEnd',
        value: endTagMatch[1]
    });
    advance(endTagMatch[0].length);
    continue;
}

const startTagMatch = html.match(startTagReg);
if(startTagMatch) {    
    options.onStartTag({
        type: 'tagStart',
        value: startTagMatch[1]
    });

    advance(startTagMatch[1].length + 1);
    let attributeMath;
    while(attributeMath = html.match(attributeReg)) {
        options.onAttribute({
            type: 'attribute',
            value: attributeMath[1]
        });
        advance(attributeMath[0].length);
    }
    advance(1);
    continue;
}

經過詞法分析，我們能拿到所有的 token：

文法分析

token 拆分之後，我們需要再把這些 token 組裝在一起，隻處理 startTag、endTag 和 text 節點。通過 currentParent 記錄目前 tag。

startTag 建立 AST，挂到 currentParent 的 children 上，然後 currentParent 變成新建立的 tag
endTag 的時候把 currentParent 設定為目前 tag 的 parent
text 也挂到 currentParent 上

function htmlParser(str) {
    const ast = {
        children: []
    };
    let curParent = ast;
    let prevParent = null;
    const domTree = parse(str,{
        onComment(node) {
        },
        onStartTag(token) {
            const tag = {
                tagName: token.value,
                attributes: [],
                text: '',
                children: []
            };
            curParent.children.push(tag);
            prevParent = curParent;
            curParent = tag;
        },
        onAttribute(token) {
            const [ name, value ] = token.value.split('=');
            curParent.attributes.push({
                name,
                value: value.replace(/^['"]/, '').replace(/['"]$/, '')
            });
        },
        onEndTag(token) {
            curParent = prevParent;
        },
        onDoctype(token) {
        },
        onText(token) {
            curParent.text = token.value;
        }
    });
    return ast.children[0];
}

我們試一下效果：

const htmlParser = require('./htmlParser');

const domTree = htmlParser(`
<!doctype html>
<body>
    <div>
        <!--button-->
        <button>按鈕</button>
        <div id="container">
            <div class="box1">
                <p>box1 box1 box1</p>
            </div>
            <div class="box2">
                <p>box2 box2 box2</p>
            </div>
        </div>
    </div>
</body>
`);

console.log(JSON.stringify(domTree, null, 4));

成功生成了正确的 AST。

總結

這篇是簡易浏覽器中 html parser 的實作，少了自閉合标簽的處理，就是差一個 if else，後面會補上。

我們分析了思路并進行了實作：通過正則來進行 token 的拆分，把拆出的 token 通過回調函數暴露出去，之後進行 AST 的組裝，需要記錄目前的 parent，來生成父子關系正确的 AST。

html parser 其實也是淘系前端的多年不變的面試題之一，而且 vue template compiler 還有 jsx 的 parser 也會用到類似的思路。還是有必要掌握的。希望本文能幫大家理清思路。

【手寫簡易浏覽器】html parser 篇

思路分析

詞法分析

文法分析

代碼實作

詞法分析

正則

分詞

文法分析

總結

繼續閱讀

Spring MVC 自學雜記（五） -- SpringMVC與前台的json資料互動

HTML addEventListener() 方法和attachEvent()差別分析

web前端布局練手項目

Django之驗證碼（十七）驗證碼

Vue項目 - 單檔案元件和Vue中的路由

龍珠訓練營task04

趕工心得（一）

一個小小的移動web版音樂播放器

Docker - Dockerfile之ADD、COPY、WORKDIR、USER、EXPOSE指令詳解

Compile workrave under windows &ndash; My exprience 在Windows上編譯Workrave

門戶通專訪草根站長九天狼：做站貴在堅持

tabpanel 使用問題

為什麼把CSS放頭部，script放下面

CSS之折疊菜單

web開發之前後端渲染

403 Forbidden，You don't have permission to access / on this server.Forbidden