发布时间:2018-05-22作者:laosun阅读(3714)
百度收录查询,或者从百度搜索出来的网址站点,去查看的时候,百度都做了一层跳转,说是加密也是加密,其实更重要的是统计.
例如:下边这个地址:
打开后真正的链接地址其实是本站的首页。那么我们在使用爬虫抓取的时候,如何获取跳转后的真实地址呢,其实百度的原理很简单,点击这个url,中间跳转的时候,在Header的location中保存着真实url。
下边我们使用jsoup包来测试一下
直接上代码:
import java.io.IOException; import org.jsoup.Connection; import org.jsoup.Connection.Method; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; public class Main { static String url = "https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&tn=baidu&wd=site%3Awww.sunjs.com&oq=site%253Awww.sunjs.com&rsv_pq=dcfd424300025550&rsv_t=73eaxwIdrsFZsM%2F9CZpT9hbGRVlnirV%2FkDFuntgwz2ra43tSXYKtn4nyprk&rqlang=cn&rsv_enter=0"; public static void main(String[] args) { try { Document doc = Jsoup.connect(url).get(); Elements listHtml = doc.select(".c-container"); if (listHtml != null && listHtml.size() > 0) { for (Element sign : listHtml) { String href = sign.selectFirst("a").attr("href"); int itimeout = 60000; try { Connection.Response res = Jsoup.connect(href).timeout(itimeout).method(Method.GET).followRedirects(false).execute(); String realUrl = res.header("Location"); System.out.println(realUrl); } catch (IOException e) { e.printStackTrace(); } } } } catch (IOException e) { e.printStackTrace(); } } }
关键代码就下边这两句:
Connection.Response res = Jsoup.connect(href).timeout(itimeout).method(Method.GET).followRedirects(false).execute(); String realUrl = res.header("Location");
运行结果:
https://www.sunjs.com/ https://www.sunjs.com/article/detail/6b1aeaed4104476bbb8ba8babc1d314f.html https://www.sunjs.com/article/detail/6ec78db2139a468d933c40ed38322ecf.html https://www.sunjs.com/article/detail/c5ec29a15f2c45908b42a4f26d9d355d.html https://www.sunjs.com/article/detail/42ffaaee8f9e40d3b10cb5f9033bcdde.html https://www.sunjs.com/article/detail/990cf56a52b147c394a4b2d4df4d7278.html https://www.sunjs.com/article/detail/cefba55bc616442eb936135e6574d021.html https://www.sunjs.com/article/detail/1450bac401114ce8a51099f38a743eb6.html https://www.sunjs.com/tag/search.action?tag=mysql https://www.sunjs.com/article/search.action?keyword=mac
版权属于: 技术客
原文地址: https://www.sunjs.com/article/detail/cacbf3e4d80449fca7ea1fb15cdb11a1.html
著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。