如果要采集网页上的数据,最简单好用的是用Python语言实现,本身就是网络编程语言,有很多组件都可以使用。当然,如果你想用C#进行数据采集,也是没问题的,也有不错的组件可以使用,今天就推荐HtmlAgilityPack这个组件。还是先到Nuget中搜索并下载到程序里,我们以采集博客园为例。
1、Load数据
这个组件提供了很多Load数据的方法,同步异步都有
string xPath = null; string urlPath = @"https://www.cnblogs.com/"; //1,加载要采集的页面 HtmlAgilityPack.HtmlWeb hw = new HtmlAgilityPack.HtmlWeb(); HtmlAgilityPack.HtmlDocument hDoc = hw.Load(urlPath);
2、生成Html代码
//2,获取采集数据 xPath = "//article[@class='post-item']"; HtmlAgilityPack.HtmlNodeCollection hnc = hDoc.DocumentNode.SelectNodes(xPath);
3、循环采集
//3,循环采集数据 List<CnblogsBillModel> listBlog = new List<CnblogsBillModel>(); foreach (HtmlAgilityPack.HtmlNode hn in hnc) { //采集标题 xPath = ".//a[@class='post-item-title']"; HtmlAgilityPack.HtmlNode hnTitle = hn.SelectSingleNode(xPath); string title = hnTitle.InnerText.Trim(); string detailUrl = hnTitle.Attributes["href"].Value; //采集摘要 xPath = ".//p[@class='post-item-summary']"; HtmlAgilityPack.HtmlNode hnSummary = hn.SelectSingleNode(xPath); string summary = hnSummary.InnerText.Trim(); //采集时间 xPath = ".//span[@class='post-meta-item']"; HtmlAgilityPack.HtmlNode hnPostTime = hn.SelectSingleNode(xPath); string postTime = hnPostTime.InnerText.Trim(); //采集作者名称和主页 xPath = ".//a[@class='post-item-author']"; HtmlAgilityPack.HtmlNode hnAuthor = hn.SelectSingleNode(xPath); string author = hnAuthor.InnerText.Trim(); string homePage = hnAuthor.Attributes["href"].Value; listBlog.Add(new CnblogsBillModel() { title = title, summary = summary, postTime = postTime, detailUrl = detailUrl, author = author, homePage = homePage, }); } string blogs = JsonHelper.GetJsonByObject(listBlog); string sss = null;
最后看下采集效果
使用到的Model类
public class CnblogsBillModel { public string title { get; set; } public string summary { get; set; } public string postTime { get; set; } public string detailUrl { get; set; } public string author { get; set; } public string homePage { get; set; } }
技术点有2个
第1个:选择器
SelectNodes()(选择与 XPath 表达式匹配的节点列表)
SelectSingleNode(String)(选择与 XPath 表达式匹配的第一个 XmlNode)
第2个:XPath
有没注意到,循环里的XPath,前面是有[.]的。其实加点和不加点是有很大区别的,在查找整个页面的时候没有区别,但是在当前元素调用此方法时就有了,加点代表取当前元素为根节点向下查找元素,而不加点是以整个页面为根元素向下查找的
文章评论