Posts

Showing posts from September, 2013

Flatten HTML Document to List of Tags, Attributes, and Values

I had a need to flatten a set of HTML documents to a list of the HTML tags in their head sections.  I thought this bit of code might be useful for someone in the future.

This uses the CsQuery library which is a port of jQuery in C#: https://github.com/jamietre/CsQuery
CsQuery also has a NuGet Package: https://www.nuget.org/packages/CsQuery


//Note: Get HTML from somewhere...
var html = "";

var cq = CsQuery.CQ.Create(html);

var head = cq["head"];

var nonScriptHeadTagsQuery =
from t in head.Children()
where
t.NodeName != "SCRIPT"
&& t.NodeName != "LINK"
select new { Tag = t, TagId = Guid.NewGuid() };

var nonScriptHeadTags = nonScriptHeadTagsQuery.ToList();

var htmlTags =
nonScriptHeadTags
.SelectMany(tagInfo => tagInfo.Tag.Attributes, (tagInfo, attribute) => new { TagInfo = tagInfo, Attribute = attribute })
.Select(x => new
{
TagId = x.TagInfo.TagId,
TagType = x.TagInfo.Tag.NodeName,
AttributeName = x.Attribute.Key,
AttributeValue = x.Attribute.Value,