如果要采集网页上的数据,最简单好用的是用Python语言实现,本身就是网络编程语言,有很多组件都可以使用。当然,如果你想用C#进行数据采集,也是没问题的,也有不错的组件可以使用,今天就推荐HtmlAgilityPack这个组件。还是先到Nuget中搜索并下载到程序里,我们以采集博客园为例。
1、Load数据
这个组件提供了很多Load数据的方法,同步异步都有
string xPath = null; string urlPath = @"https://www.cnblogs.com/"; //1,加载要采集的页面 HtmlAgilityPack.HtmlWeb hw = new HtmlAgilityPack.HtmlWeb(); HtmlAgilityPack.HtmlDocument hDoc = hw.Load(urlPath);
2、生成Html代码
//2,获取采集数据 xPath = "//article[@class='post-item']"; HtmlAgilityPack.HtmlNodeCollection hnc = hDoc.DocumentNode.SelectNodes(xPath);
3、循环采集
//3,循环采集数据
List<CnblogsBillModel> listBlog = new List<CnblogsBillModel>();
foreach (HtmlAgilityPack.HtmlNode hn in hnc)
{
//采集标题
xPath = ".//a[@class='post-item-title']";
HtmlAgilityPack.HtmlNode hnTitle = hn.SelectSingleNode(xPath);
string title = hnTitle.InnerText.Trim();
string detailUrl = hnTitle.Attributes["href"].Value;
//采集摘要
xPath = ".//p[@class='post-item-summary']";
HtmlAgilityPack.HtmlNode hnSummary = hn.SelectSingleNode(xPath);
string summary = hnSummary.InnerText.Trim();
//采集时间
xPath = ".//span[@class='post-meta-item']";
HtmlAgilityPack.HtmlNode hnPostTime = hn.SelectSingleNode(xPath);
string postTime = hnPostTime.InnerText.Trim();
//采集作者名称和主页
xPath = ".//a[@class='post-item-author']";
HtmlAgilityPack.HtmlNode hnAuthor = hn.SelectSingleNode(xPath);
string author = hnAuthor.InnerText.Trim();
string homePage = hnAuthor.Attributes["href"].Value;
listBlog.Add(new CnblogsBillModel()
{
title = title,
summary = summary,
postTime = postTime,
detailUrl = detailUrl,
author = author,
homePage = homePage,
});
}
string blogs = JsonHelper.GetJsonByObject(listBlog);
string sss = null;
最后看下采集效果
使用到的Model类
public class CnblogsBillModel
{
public string title { get; set; }
public string summary { get; set; }
public string postTime { get; set; }
public string detailUrl { get; set; }
public string author { get; set; }
public string homePage { get; set; }
}
技术点有2个
第1个:选择器
SelectNodes()(选择与 XPath 表达式匹配的节点列表)
SelectSingleNode(String)(选择与 XPath 表达式匹配的第一个 XmlNode)
第2个:XPath
有没注意到,循环里的XPath,前面是有[.]的。其实加点和不加点是有很大区别的,在查找整个页面的时候没有区别,但是在当前元素调用此方法时就有了,加点代表取当前元素为根节点向下查找元素,而不加点是以整个页面为根元素向下查找的



文章评论