Java精确抽取网页发布时间

针对Java精确抽取网页发布时间,下面是完整的攻略,包含以下几个步骤:

针对Java精确抽取网页发布时间,下面是完整的攻略,包含以下几个步骤:

1. 获取HTML网页源代码

使用HttpClient或Jsoup等网络库,向目标网页发送请求,获取返回的HTML文本内容。

示例1-使用HttpClient获取HTML网页源代码:

import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;

public class HtmlSourceExtractor {

    public static String getHtml(String url) throws Exception {
        CloseableHttpClient httpClient = HttpClients.createDefault();
        HttpGet httpGet = new HttpGet(url);
        String htmlContent;
        try (CloseableHttpResponse response = httpClient.execute(httpGet)) {
            HttpEntity entity = response.getEntity();
            htmlContent = EntityUtils.toString(entity, "UTF-8");
        }
        return htmlContent;
    }
}

示例2-使用Jsoup获取HTML网页源代码:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class HtmlSourceExtractor {

    public static String getHtml(String url) throws Exception {
        Document doc = Jsoup.connect(url).get();
        String htmlContent = doc.html();
        return htmlContent;
    }
}

2. 利用正则表达式匹配网页发布时间

在获取到HTML文本内容后,选择合适的正则表达式,匹配出发布时间信息。常用的时间格式包括:yyyy-MM-dd HH:mm:ss, yyyy/MM/dd HH:mm:ss, yyyy年MM月dd日 HH:mm:ss等。

示例1-通过正则表达式抽取京东商品发布时间:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class TimeExtractor {

    private static final String JD_TIME_REGEX = "itemprop=\"datePublished\" content=\"(.*?)\"";

    public static String extractTimeFromJdHtml(String htmlContent) {
        Pattern pattern = Pattern.compile(JD_TIME_REGEX);
        Matcher matcher = pattern.matcher(htmlContent);
        if (matcher.find()) {
            return matcher.group(1);
        }
        return null;
    }
}

示例2-通过正则表达式抽取知乎问题发布时间:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class TimeExtractor {

    private static final String ZHIHU_TIME_REGEX = "<span class=\"MetaItem\">\n"
            + "\\s*(.+?)\n"
            + "\\s*</span>";

    public static String extractTimeFromZhihuHtml(String htmlContent) {
        Pattern pattern = Pattern.compile(ZHIHU_TIME_REGEX, Pattern.DOTALL);
        Matcher matcher = pattern.matcher(htmlContent);
        if (matcher.find()) {
            String timeStr = matcher.group(1);
            return timeStr.replaceAll("\\s+年\\s+", "-")
                    .replaceAll("\\s+月\\s+", "-")
                    .replaceAll("\\s+日\\s+", " ")
                    .replaceAll("上午|下午", "");
        }
        return null;
    }
}

3. 转换时间格式为标准格式

将匹配到的时间字符串转换为标准的日期时间格式,例如用Java的SimpleDateFormat类进行格式化。

示例1-将京东商品发布时间转换为标准日期时间格式:

import java.text.SimpleDateFormat;
import java.util.Date;

public class TimeFormatConverter {

    private static final String JD_TIME_PATTERN = "yyyy-MM-dd HH:mm:ss";

    public static Date convertJdTimeToStandardFormat(String jdTimeStr) throws Exception {
        SimpleDateFormat sdf = new SimpleDateFormat(JD_TIME_PATTERN);
        return sdf.parse(jdTimeStr);
    }
}

示例2-将知乎问题发布时间转换为标准日期时间格式:

import java.text.SimpleDateFormat;
import java.util.Date;

public class TimeFormatConverter {

    private static final String ZHIHU_TIME_PATTERN = "yyyy-MM-dd HH:mm:ss";

    public static Date convertZhihuTimeToStandardFormat(String zhihuTimeStr) throws Exception {
        SimpleDateFormat sdf = new SimpleDateFormat(ZHIHU_TIME_PATTERN);
        return sdf.parse(zhihuTimeStr);
    }
}

4. 完整代码

综上所述,完整的Java代码如下所示:

import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.apache.commons.lang3.StringUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class TimeExtractor {

    public static void main(String[] args) throws Exception {
        String jdUrl = "https://item.jd.com/100011288958.html";
        String jdHtmlContent = HtmlSourceExtractor.getHtml(jdUrl);
        String jdTimeStr = extractTimeFromJdHtml(jdHtmlContent);
        System.out.println("[京东] 发布时间为:" + jdTimeStr);
        Date jdTime = convertJdTimeToStandardFormat(jdTimeStr);
        System.out.println("[京东] 转换后时间为:" + jdTime);

        String zhihuUrl = "https://www.zhihu.com/question/471734788/answer/2009389214";
        String zhihuHtmlContent = HtmlSourceExtractor.getHtml(zhihuUrl);
        String zhihuTimeStr = extractTimeFromZhihuHtml(zhihuHtmlContent);
        System.out.println("[知乎] 发布时间为:" + zhihuTimeStr);
        Date zhihuTime = convertZhihuTimeToStandardFormat(zhihuTimeStr);
        System.out.println("[知乎] 转换后时间为:" + zhihuTime);
    }

    public static String extractTimeFromJdHtml(String htmlContent) {
        String jdTimeRegex = "itemprop=\"datePublished\" content=\"(.*?)\"";
        Pattern pattern = Pattern.compile(jdTimeRegex);
        Matcher matcher = pattern.matcher(htmlContent);
        if (matcher.find()) {
            return matcher.group(1);
        }
        return null;
    }

    public static String extractTimeFromZhihuHtml(String htmlContent) {
        String zhihuTimeRegex = "<span class=\"MetaItem\">\n"
                + "\\s*(.+?)\n"
                + "\\s*</span>";
        Pattern pattern = Pattern.compile(zhihuTimeRegex, Pattern.DOTALL);
        Matcher matcher = pattern.matcher(htmlContent);
        if (matcher.find()) {
            String timeStr = matcher.group(1);
            return timeStr.replaceAll("\\s+年\\s+", "-")
                    .replaceAll("\\s+月\\s+", "-")
                    .replaceAll("\\s+日\\s+", " ")
                    .replaceAll("上午|下午", "");
        }
        return null;
    }

    public static Date convertJdTimeToStandardFormat(String jdTimeStr) throws Exception {
        if (StringUtils.isBlank(jdTimeStr)) {
            return null;
        }
        String jdTimePattern = "yyyy-MM-dd HH:mm:ss";
        SimpleDateFormat sdf = new SimpleDateFormat(jdTimePattern);
        return sdf.parse(jdTimeStr);
    }

    public static Date convertZhihuTimeToStandardFormat(String zhihuTimeStr) throws Exception {
        if (StringUtils.isBlank(zhihuTimeStr)) {
            return null;
        }
        String zhihuTimePattern = "yyyy-MM-dd HH:mm:ss";
        SimpleDateFormat sdf = new SimpleDateFormat(zhihuTimePattern);
        return sdf.parse(zhihuTimeStr);
    }

    public static String getHtml(String url) throws Exception {
        Document doc = Jsoup.connect(url).get();
        return doc.html();
    }
}

以上就是Java精确抽取网页发布时间的完整攻略。

本文标题为:Java精确抽取网页发布时间