Merge pull request #154 from lonekorean/v3

v3
This commit is contained in:
Will Boyd
2025-03-18 14:54:39 -04:00
committed by GitHub
30 changed files with 1550 additions and 2112 deletions
-17
View File
@@ -1,17 +0,0 @@
{
"env": {
"commonjs": true,
"es6": true,
"node": true
},
"extends": "eslint:recommended",
"globals": {
"Atomics": "readonly",
"SharedArrayBuffer": "readonly"
},
"parserOptions": {
"ecmaVersion": 2018
},
"rules": {
}
}
+11
View File
@@ -0,0 +1,11 @@
# How to Contribute
Contributions are welcome! Thank you!
## General Guidelines
Some quick notes when making a pull request.
- Match the style and formatting of the code you are editing.
- Each pull request should be focused on a single thing (a single bug fix, a single feature, etc.). This makes reviewing easier and minimizes merge conflicts.
- Include a description of the problem being solved and what your code does. Steps to reproduce the problem or example input/output are very helpful.
-23
View File
@@ -1,23 +0,0 @@
# How to Contribute
Contributions are welcome! Thank you!
## General Guidelines
Some quick notes when making a pull request.
- Match the style and formatting of the code you are editing.
- Each pull request should be focused on a single thing (a single bug fix, a single feature, etc.). This makes reviewing easier and minimizes merge conflicts.
- Include a description of the problem being solved and what your code does. Steps to reproduce the problem or example input/output are very helpful.
## Adding Options
Keeping the wizard as short as possible is a priority. Pull requests that add options to the wizard will probably not be accepted. Instead, you can add an advanced setting to [settings.js](https://github.com/lonekorean/wordpress-export-to-markdown/blob/master/src/settings.js).
## Adding Frontmatter Fields
Similarly, default frontmatter output is limited to just a few widely used fields to avoid bloat. However, you may add new optional frontmatter fields.
To do so, follow the instructions in [/src/frontmatter/example.js](https://github.com/lonekorean/wordpress-export-to-markdown/blob/master/src/frontmatter/example.js).
Users will be able to include your new frontmatter field by editing `frontmatter_fields` in [settings.js](https://github.com/lonekorean/wordpress-export-to-markdown/blob/master/src/settings.js).
+181 -84
View File
@@ -1,150 +1,247 @@
# wordpress-export-to-markdown
Converts a WordPress export file into Markdown files that are compatible with static site generators ([Eleventy](https://www.11ty.dev/), [Gatsby](https://www.gatsbyjs.com/), [Hugo](https://gohugo.io/), etc.).
Converts a WordPress export XML file into Markdown files. This makes it easy to migrate from WordPress to a static site generator ([Eleventy](https://www.11ty.dev/), [Gatsby](https://www.gatsbyjs.com/), [Hugo](https://gohugo.io/), etc.).
Each post is saved as a separate Markdown file with frontmatter. Images are downloaded and saved.
![wordpress-export-to-markdown running in a terminal](https://github.com/user-attachments/assets/7ac1aa07-b6ee-46f4-ab49-291c1c45f350)
![wordpress-export-to-markdown running in a terminal](https://user-images.githubusercontent.com/1245573/72686026-3aa04280-3abe-11ea-92c1-d756a24657dd.gif)
## Features
- Saves each post as a separate Markdown file with frontmatter.
- Also saves drafts, pages, and custom post types, if you have any.
- Downloads images and updates references to them.
- User-friendly wizard guides you through the process.
- Lots of command line options for configuration, if needed.
## Quick Start
You'll need:
- [Node.js](https://nodejs.org/) installed
- Your [WordPress export file](https://wordpress.org/support/article/tools-export-screen/) (be sure to export "All content").
To make things easier, you can rename your WordPress export file to `export.xml` and drop it into the same directory that you run this script from.
- [Node.js](https://nodejs.org/) installed.
- Your [WordPress export file](https://wordpress.org/support/article/tools-export-screen/). Be sure to export "All Content".
You can run this script immediately in your terminal with `npx`:
Then run this in your terminal:
```
npx wordpress-export-to-markdown
```
Or you can clone this repo, then from within the repo's directory, install and run:
## Options
```
npm install && node index.js
```
The script will start with a wizard to ask you a few questions.
Either way, the script will start a wizard to configure your options. Answer the questions and off you go!
## Command Line
Options can also be configured via the command line. The wizard will skip asking about any such options. For example, the following will give you [Jekyll](https://jekyllrb.com/)-style output in terms of folder structure and filenames.
Using `npx`:
Optionally, you can provide answers to any of these questions via command line arguments, in which case the wizard will skip asking those questions. Here's an example:
```
npx wordpress-export-to-markdown --post-folders=false --prefix-date=true
```
Using a locally cloned repo:
```
node index.js --post-folders=false --prefix-date=true
```
The wizard will still ask you about any options not specified on the command line. To skip the wizard entirely and use default values for unspecified options, add `--wizard=false`.
## Options
These are the questions asked by the wizard. Command line arguments, along with their default values, are also being provided here if you want to use them.
The questions are given below, including a snippet for each one showing its command line argument set to its default value.
### Path to WordPress export file?
**Command line:** `--input=export.xml`
```
--input=export.xml
```
The path to your WordPress export file. To make things easier, you can rename your WordPress export file to `export.xml` and drop it into the same directory that you run this script from.
The path to your [WordPress export file](https://wordpress.org/documentation/article/tools-export-screen/). To make things easier, you can rename it to `export.xml` and drop it into the same directory that you run the script from.
Allowed values:
- Any path to a file that exists.
### Put each post into its own folder?
```
--post-folders=true
```
Whether or not to create a separate folder for each post's Markdown file (and images).
Allowed values:
- `true` - A folder is created for each post, with an `index.md` file and `/images` folder within. The post slug is used to name the folder.
- `false` - The post slug is used to name each post's Markdown file. These files are all saved in the same folder. All images are saved in a shared `/images` folder.
### Add date prefix to posts?
```
--prefix-date=false
```
Whether or not to prepend the post date when naming a post's folder or file.
Allowed values:
- `true` - Prepend the date, in the format `<year>-<month>-<day>`. Nothing will be prepended if there is no date (for example, an undated draft post).
- `false` - Don't prepend the date.
### Organize posts into date folders?
```
--date-folders=none
```
If and how output is organized into folders based on date.
Allowed values:
- `year` - Output is organized into folders by year. This won't happen for posts with no date (for example, an undated draft post).
- `yearmonth` - Output is organized into folders by year, then into nested folders by month. Again, for posts with no date, this won't happen.
- `none` - No date folders are created.
### Save images?
```
--save-images=all
```
Which images you want to download and save.
Allowed values:
- `attached` - Save images attached to posts. Generally speaking, these are images that were uploaded by using **Add Media** or **Set Featured Image** in WordPress.
- `scraped` - Save images scraped from `<img>` tags in post body content. The `<img>` tags are updated to point to where the images are saved.
- `all` - Save all images, essentially the results of `attached` and `scraped` combined.
- `none` - Don't save any images.
## Advanced Options
These are not included in the wizard, so you'll need to set them on the command line.
### Use wizard?
```
--wizard=true
```
Whether or not to use the wizard.
Allowed values:
- `true` - The script will start with a wizard to ask five questions (the ones from the [Options](#options) section) minus any that were answered on the command line.
- `false` - Skip wizard. Options set via command line are taken, while the rest have their default values used.
### Path to output folder?
**Command line:** `--output=output`
```
--output=output
```
The path to the output directory where Markdown and image files will be saved. If it does not exist, it will be created.
The path to the output folder where files will be saved. It'll be created if it doesn't exist. Existing files there won't be overwritten and won't be downloaded again. This lets you resume progress by restarting the script, if it was previously terminated early. To start clean, delete the output folder.
### Create year folders?
Allowed values:
**Command line:** `--year-folders=false`
- Any valid folder path.
Whether or not to organize output files into folders by year.
### Frontmatter fields?
### Create month folders?
```
--frontmatter-fields=title,date,categories,tags,coverImage,draft
```
**Command line:** `--month-folders=false`
Comma separated list of the frontmatter fields to include in Markdown files. Order is preserved. If a post doesn't have a value for a field, it is left off.
Whether or not to organize output files into folders by month. You'll probably want to combine this with `--year-folders` to organize files by year then month.
Allowed values:
### Create a folder for each post?
- A comma separated list with any of the following: `author`, `categories`, `coverImage`, `date`, `draft`, `excerpt`, `id`, `slug`, `tags`, `title`, `type`. You can rename a field by appending `:` and the alias to use. For example, `date:created` will rename `date` to `created`.
**Command line:** `--post-folders=true`
### Delay between image file requests?
Whether or not to save files and images into post folders.
```
--request-delay=500
```
If `true`, the post slug is used for the folder name and the post's Markdown file is named `index.md`. Each post folder will have its own `/images` folder.
Time (in milliseconds) to wait between requesting image files. Increasing this might help if you see timeouts or server errors.
/first-post
/images
potato.png
index.md
/second-post
/images
carrot.jpg
celery.jpg
index.md
Allowed values:
If `false`, the post slug is used to name the post's Markdown file. These files will be side-by-side and images will go into a shared `/images` folder.
- Any positive integer.
/images
carrot.jpg
celery.jpg
potato.png
first-post.md
second-post.md
### Delay between writing markdown files?
Either way, this can be combined with with `--year-folders` and `--month-folders`, in which case the above output will be organized under the appropriate year and month folders.
```
--write-delay=10
```
### Prefix post folders/files with date?
Time (in milliseconds) to wait between saving Markdown files. Increasing this might help if your file system becomes overloaded.
**Command line:** `--prefix-date=false`
Allowed values:
Whether or not to prepend the post date to the post slug when naming a post's folder or file.
- Any positive integer.
If `--post-folders` is `true`, this affects the folder.
### Timezone to apply to date?
/2019-10-14-first-post
index.md
/2019-10-23-second-post
index.md
```
--timezone=utc
```
If `--post-folders` is `false`, this affects the file.
The timezone applied to post dates.
2019-10-14-first-post.md
2019-10-23-second-post.md
Allowed values:
### Save images attached to posts?
- Any valid timezone as [specified here](https://moment.github.io/luxon/#/zones?id=specifying-a-zone).
**Command line:** `--save-attached-images=true`
### Include time with frontmatter date?
Whether or not to download and save images attached to posts. Generally speaking, these are images that were uploaded by using **Add Media** or **Set Featured Image** in WordPress. Images are saved into `/images`.
```
--include-time=false
```
### Save images scraped from post body content?
Whether or not time should be included with the date in frontmatter.
**Command line:** `--save-scraped-images=true`
Allowed values:
Whether or not to download and save images scraped from `<img>` tags in post body content. Images are saved into `/images`. The `<img>` tags are updated to point to where the images are saved.
- `true` - Time is included using an ISO 8601-compliant format. For example, `2020-12-25T11:20:35.000Z`.
- `false` - Time is not included. For example, `2020-12-25`.
### Include custom post types and pages?
### Frontmatter date format string?
**Command line:** `--include-other-types=false`
```
--date-format=""
```
Some WordPress sites make use of a `"page"` post type and/or custom post types. Set this to `true` to include these post types in the output. Posts will be organized into post type folders.
A custom formatting string to apply to frontmatter dates. If set, takes precedence over `--include-time`. An empty string (the default) is ignored, resulting in the basic `<year>-<month>-<day>` format.
## Customizing Frontmatter and Other Advanced Settings
Allowed values:
You can edit [settings.js](https://github.com/lonekorean/wordpress-export-to-markdown/blob/master/src/settings.js) to configure advanced settings beyond the options above. This includes things like customizing frontmatter, date formatting, throttling image downloads, and more.
- Any valid custom formatting string. See [this table of tokens](https://moment.github.io/luxon/#/parsing?id=table-of-tokens).
You'll need to run the script locally (not using `npx`) to edit these advanced settings.
### Wrap frontmatter date in quotes?
```
--quote-date=false
```
Whether or not to put double quotes around the date when writing it to frontmatter.
Allowed values:
- `true` - Adds double quotes. This technically turns the date into a string value.
- `false` - Doesn't add double quotes.
### Use strict SSL?
```
--strict-ssl=true
```
Whether or not to use strict SSL when downloading images.
Allowed values:
- `true` - Use strict SSL. This is the safer option.
- `false` - Don't use strict SSL. This will let you avoid the "self-signed certificate" error when working with a self-signed server. Just make sure you know what you're doing.
## Local Development
You can install and run this script locally if you want to tinker with it:
1. `git clone` this repo.
2. `cd` into the repo directory.
3. Run `npm install`.
Now instead of running `npx wordpress-export-to-markdown` you can run `node app`. They both take all the same command line arguments in the same way.
## Contributing
Please read the [contribution guidelines](https://github.com/lonekorean/wordpress-export-to-markdown/blob/master/CONTRIBUTING.md).
Please read the [contribution guidelines](https://github.com/lonekorean/wordpress-export-to-markdown/blob/master/.github/CONTRIBUTING.md).
+38
View File
@@ -0,0 +1,38 @@
#!/usr/bin/env node
import chalk from 'chalk';
import * as commander from 'commander';
import path from 'path';
import * as intake from './src/intake.js';
import * as parser from './src/parser.js';
import * as shared from './src/shared.js';
import * as writer from './src/writer.js';
(async () => {
// configure command line help output
commander.program
.name('npx wordpress-export-to-markdown')
.helpOption('-h, --help', 'See the thing you\'re looking at right now')
.addHelpText('after', '\nMore documentation is at https://github.com/lonekorean/wordpress-export-to-markdown')
.configureHelp({
styleOptionTerm: (str) => str.replace(/(<.*>)$/, chalk.gray('$1')),
styleOptionDescription: (str) => str.replace(/(\(.*\))$/, chalk.gray('$1'))
});
// gather config options from command line and wizard
await intake.getConfig();
// parse data from XML and do Markdown translations
const posts = await parser.parseFilePromise()
// write files and download images
await writer.writeFilesPromise(posts);
// happy goodbye
console.log('\nAll done!');
console.log('Look for your output files in: ' + path.resolve(shared.config.output));
})().catch((ex) => {
// sad goodbye
console.log('\nSomething went wrong, execution halted early.');
console.error(ex);
});
-27
View File
@@ -1,27 +0,0 @@
#!/usr/bin/env node
const path = require('path');
const process = require('process');
const wizard = require('./src/wizard');
const parser = require('./src/parser');
const writer = require('./src/writer');
(async () => {
// parse any command line arguments and run wizard
const config = await wizard.getConfig(process.argv);
// parse data from XML and do Markdown translations
const posts = await parser.parseFilePromise(config)
// write files, downloading images as needed
await writer.writeFilesPromise(posts, config);
// happy goodbye
console.log('\nAll done!');
console.log('Look for your output files in: ' + path.resolve(config.output));
})().catch(ex => {
// sad goodbye
console.log('\nSomething went wrong, execution halted early.');
console.error(ex);
});
+438 -1376
View File
File diff suppressed because it is too large Load Diff
+9 -16
View File
@@ -2,7 +2,7 @@
"name": "wordpress-export-to-markdown",
"version": "2.4.2",
"description": "Converts a WordPress export XML file into Markdown files.",
"main": "index.js",
"main": "app.js",
"repository": "https://github.com/lonekorean/wordpress-export-to-markdown.git",
"keywords": [
"blog",
@@ -11,30 +11,23 @@
"markdown",
"wordpress"
],
"scripts": {
"test": "echo \"Error: no test specified\" && exit 1"
},
"author": "Will Boyd <will@codersblock.com> (https://codersblock.com)",
"license": "MIT",
"engines": {
"node": ">= 18.0.0"
"node": ">= 20.5.0"
},
"type": "module",
"dependencies": {
"axios": "^1.7.9",
"camelcase": "^6.3.0",
"chalk": "^4.1.2",
"commander": "^13.0.0",
"inquirer": "^8.2.6",
"@guyplusplus/turndown-plugin-gfm": "^1.0.7",
"@inquirer/prompts": "^7.4.0",
"axios": "^1.8.2",
"chalk": "^5.4.1",
"commander": "^13.1.0",
"luxon": "^3.5.0",
"require-directory": "^2.1.1",
"turndown": "^7.2.0",
"turndown-plugin-gfm": "^1.0.2",
"xml2js": "^0.6.2"
},
"devDependencies": {
"eslint": "^8.57.1"
},
"bin": {
"wordpress-export-to-markdown": "./index.js"
"wordpress-export-to-markdown": "./app.js"
}
}
+108
View File
@@ -0,0 +1,108 @@
import xml2js from 'xml2js';
class Data {
#obj;
#expression;
constructor(obj, expression) {
// xml2js returns leaf nodes as strings, turn those into consistent objects
// I found this to be safer and more efficient than using the explicitCharkey option
this.#obj = typeof obj === 'string' ? { _: obj } : obj;
// this identifies how the object was referenced, helps a ton with debugging
this.#expression = expression;
}
#buildExpression(propName, index = undefined) {
let expression = `${this.#expression}.${propName}`;
if (index !== undefined) {
expression += `[${index}]`;
}
return expression;
}
// used by "optional" functions to return undefined instead of throwing an error
#optional(func) {
try {
return func();
} catch (ex) {
return undefined;
}
}
// will not throw an error if property doesn't exist, defaults to empty array
children(propName) {
const nodes = this.#obj[propName] ?? [];
return nodes.map((value, index) => new Data(value, this.#buildExpression(propName, index)));
}
// throws an error if property (or index on property) doesn't exist
child(propName, index = 0) {
const nodes = this.#obj[propName];
if (nodes === undefined) {
throw new Error(`Could not find ${this.#buildExpression(propName)}.`);
}
const node = nodes[index];
if (node === undefined) {
throw new Error(`Could not find ${this.#buildExpression(propName, index)}.`);
}
return new Data(node, this.#buildExpression(propName, index));
}
// convenience function, since it's very common to want the value of a child
childValue(propName, index = 0) {
return this.child(propName, index).value();
}
// throws an error if this object doesn't have a value string
value() {
const value = this.#obj._;
if (value === undefined) {
throw new Error(`Could not get value from ${this.#expression}.`);
}
return value;
}
// throws an error if attribute does not exist
attribute(attrName) {
const attribute = this.#obj.$?.[attrName];
if (attribute === undefined) {
throw new Error(`Could not get attribute ${attrName} from ${this.#expression}.`);
}
return attribute;
}
optionalChild(propName, index = 0) {
return this.#optional(() => this.child(propName, index));
}
optionalChildValue(propName, index = 0) {
return this.#optional(() => this.childValue(propName, index));
}
optionalValue() {
return this.#optional(() => this.value());
}
}
export async function load(content) {
const rootData = await xml2js.parseStringPromise(content, {
tagNameProcessors: [xml2js.processors.stripPrefix],
trim: true
}).catch((ex) => {
ex.message = 'Could not parse XML. This likely means your import file is malformed.\n\n' + ex.message;
throw ex;
});
const rssData = rootData.rss;
if (rssData === undefined) {
throw new Error('Could not find <rss> root node. This likely means your import file is malformed.')
}
return new Data(rssData, 'rss');
}
+63
View File
@@ -0,0 +1,63 @@
export function author(post) {
// not decoded (WordPress doesn't allow funky characters in usernames anyway)
// surprisingly, does not always exist (squarespace exports, for example)
return post.data.optionalChildValue('creator');
}
export function categories(post) {
// array of decoded category names, excluding 'uncategorized'
const categories = post.data.children('category');
return categories
.filter((category) => category.attribute('domain') === 'category' && category.attribute('nicename') !== 'uncategorized')
.map((category) => decodeURIComponent(category.attribute('nicename')));
}
export function coverImage(post) {
// cover image filename, previously parsed and decoded
return post.coverImage;
}
export function date(post) {
// a luxon datetime object, previously parsed
return post.date;
}
export function draft(post) {
// boolean representing the previously parsed draft status, only included when true
return post.isDraft ? true : undefined;
}
export function excerpt(post) {
// not decoded, newlines collapsed
// does not always exist (squarespace exports, for example)
const encoded = post.data.optionalChildValue('encoded', 1);
return encoded ? encoded.replace(/[\r\n]+/gm, ' ') : undefined;
}
export function id(post) {
// previously parsed as a string, converted to integer here
return parseInt(post.id);
}
export function slug(post) {
// previously parsed and decoded
return post.slug;
}
export function tags(post) {
// array of decoded tag names (yes, they come from <category> nodes, not a typo)
const categories = post.data.children('category');
return categories
.filter((category) => category.attribute('domain') === 'post_tag')
.map((category) => decodeURIComponent(category.attribute('nicename')));
}
export function title(post) {
// not decoded
return post.data.childValue('title');
}
export function type(post) {
// previously parsed but not decoded, can be "post", "page", or other custom types
return post.type;
}
-5
View File
@@ -1,5 +0,0 @@
// get author, without decoding
// WordPress doesn't allow funky characters in usernames anyway
module.exports = (post) => {
return post.data.creator[0];
}
-14
View File
@@ -1,14 +0,0 @@
const settings = require('../settings');
// get array of decoded category names, filtered as specified in settings
module.exports = (post) => {
if (!post.data.category) {
return [];
}
const categories = post.data.category
.filter(category => category.$.domain === 'category')
.map(({ $: attributes }) => decodeURIComponent(attributes.nicename));
return categories.filter(category => !settings.filter_categories.includes(category));
};
-5
View File
@@ -1,5 +0,0 @@
// get cover image filename, previously decoded and set on post.meta
// this one is unique as it relies on special logic executed by the parser
module.exports = (post) => {
return post.meta.coverImage;
};
-17
View File
@@ -1,17 +0,0 @@
const luxon = require('luxon');
const settings = require('../settings');
// get post date, optionally formatted as specified in settings
// this value is also used for year/month folders, date prefixes, etc. as needed
module.exports = (post) => {
const dateTime = luxon.DateTime.fromRFC2822(post.data.pubDate[0], { zone: settings.custom_date_timezone });
if (settings.custom_date_formatting) {
return dateTime.toFormat(settings.custom_date_formatting);
} else if (settings.include_time_with_date) {
return dateTime.toISO();
} else {
return dateTime.toISODate();
}
};
-19
View File
@@ -1,19 +0,0 @@
/*
1. Copy this file, rename to the frontmatter field name you want, camelcased
2. Edit frontmatter_fields in settings.js to include your new field name
3. Run the script to see post data dumps, to see what you can work with
4. Write your code to get and return what you want
5. Update "get whatever" comment to describe what you're getting
6. Remove your field name from frontmatter_fields in settings.js
7. Remove this comment block and the debug console code
8. Make that pull request!
*/
// get whatever
module.exports = (post) => {
console.log('\nBEGIN POST DATA DUMP ===========================================================\n');
console.dir(post, { depth: null });
console.log('\nEND POST DATA DUMP =============================================================\n');
return 'EXAMPLE: ' + post.data.title[0];
};
-4
View File
@@ -1,4 +0,0 @@
// get excerpt, not decoded, newlines collapsed
module.exports = (post) => {
return post.data.encoded[1].replace(/[\r\n]+/gm, ' ');
};
-4
View File
@@ -1,4 +0,0 @@
// get ID
module.exports = (post) => {
return post.data.post_id[0];
}
-4
View File
@@ -1,4 +0,0 @@
// get slug, previously decoded and set on post.meta
module.exports = (post) => {
return post.meta.slug;
};
-12
View File
@@ -1,12 +0,0 @@
// get array of decoded tag names
module.exports = (post) => {
if (!post.data.category) {
return [];
}
const categories = post.data.category
.filter(category => category.$.domain === 'post_tag')
.map(({ $: attributes }) => decodeURIComponent(attributes.nicename));
return categories;
};
-4
View File
@@ -1,4 +0,0 @@
// get simple post title, but not decoded like other frontmatter string fields
module.exports = (post) => {
return post.data.title[0];
};
-5
View File
@@ -1,5 +0,0 @@
// get type, often this will always be "post"
// but can also be "page" or other custom types
module.exports = (post) => {
return post.data.post_type[0];
}
+165
View File
@@ -0,0 +1,165 @@
import chalk from 'chalk';
import * as commander from 'commander';
import * as luxon from 'luxon';
import path from 'path';
import * as normalizers from './normalizers.js';
import * as questions from './questions.js';
import * as shared from './shared.js';
// visual formatting for wizard
const promptTheme = {
prefix: {
idle: chalk.gray('\n?'),
done: chalk.green('✓')
},
style: {
description: (text) => chalk.gray('example: ' + text)
}
};
export async function getConfig() {
// check command line for any config options
const commandLineQuestions = questions.load();
const commandLineAnswers = getCommandLineAnswers(commandLineQuestions);
let wizardAnswers;
if (commandLineAnswers.wizard) {
shared.logHeading('Starting wizard');
// run wizard for questions with prompts that were not answered via the command line
const wizardQuestions = questions.load().filter((question) => {
return question.prompt && !(shared.camelCase(question.name) in commandLineAnswers);
});
wizardAnswers = await getWizardAnswers(wizardQuestions, commandLineAnswers);
} else {
shared.logHeading('Skipping wizard');
}
Object.assign(shared.config, commandLineAnswers, wizardAnswers);
}
function getCommandLineAnswers(questions) {
// show errors in red
commander.program.configureOutput({
outputError: (str, write) => write(chalk.red(str))
});
questions.forEach((question) => {
const option = new commander.Option('--' + question.name + ' <' + question.type + '>', question.description);
option.default(question.default);
if (!question.description) {
option.hideHelp();
}
if (question.choices && question.type !== 'boolean') {
// let commander handle non-boolean multiple choice validation
option.choices(question.choices.map((choice) => choice.value));
} else {
option.argParser((value) => normalize(value, question.type, (errorMessage) => {
throw new commander.InvalidArgumentError(errorMessage);
}));
}
commander.program.addOption(option);
});
const answers = commander.program.parse().opts();
// do some post-processing on the answers
for (const [key, value] of Object.entries(answers)) {
// the "wizard" answer and any user-provided (not defaulted) answers are left alone
if (key === 'wizard' || commander.program.getOptionValueSource(key) !== 'default') {
continue;
}
const question = questions.find((question) => shared.camelCase(question.name) === key);
if (answers.wizard && question.prompt) {
// remove this default answer, allowing the wizard to ask about it later
delete answers[key];
} else {
// normalize and validate default answer
answers[key] = normalize(value, question.type, (errorMessage) => {
// this is formatted to match how commander displays other errors
commander.program.error(`error: option '--${question.name} <${question.type}>' argument '${value}' is invalid. ${errorMessage}`);
});
}
}
return answers;
}
export async function getWizardAnswers(questions, commandLineAnswers) {
const answers = {};
for (const question of questions) {
let answerKey = shared.camelCase(question.name);
let normalizedAnswer; // holds normalized answer value potentially returned during validation
const promptConfig = {
theme: promptTheme,
message: question.description + '?',
default: question.default,
};
if (question.choices) {
promptConfig.choices = question.choices;
promptConfig.loop = false;
if (question.isPathQuestion) {
promptConfig.choices.forEach((choice) => {
// show example path if this choice is selected
choice.description = buildSamplePostPath({
...commandLineAnswers, // with command line answers
...answers, // and wizard answers so far
output: path.sep, // and a simplified output folder
[answerKey]: choice.value // and this choice selected
});
});
}
} else {
promptConfig.validate = (value) => {
let validationErrorMessage;
normalizedAnswer = normalize(value, question.type, (errorMessage) => {
validationErrorMessage = errorMessage;
});
return validationErrorMessage ?? true;
}
}
const answer = await question.prompt(promptConfig).catch((ex) => {
// exit gracefully if user hits ctrl + c during wizard
if (ex instanceof Error && ex.name === 'ExitPromptError') {
console.log('\nUser quit wizard early.');
process.exit(0);
} else {
throw ex;
}
});
answers[answerKey] = normalizedAnswer ?? answer;
}
return answers;
}
function normalize(value, type, onError) {
const normalizer = normalizers[shared.camelCase(type)];
if (!normalizer) {
return value;
}
try {
return normalizer(value);
} catch (ex) {
onError(ex.message);
}
}
export function buildSamplePostPath(overrideConfig) {
const samplePost = {
date: luxon.DateTime.now(),
slug: 'my-post'
};
return shared.buildPostPath(samplePost, overrideConfig);
}
+49
View File
@@ -0,0 +1,49 @@
import fs from 'fs';
import path from 'path';
export function boolean(value) {
if (typeof value === 'boolean') {
return value;
} else if (value === 'true') {
return true;
} else if (value === 'false') {
return false;
}
throw new Error('Must be true or false.');
}
export function filePath(value) {
const unwrapped = value.replace(/"(.*?)"/, '$1');
const absolute = path.resolve(unwrapped);
let fileExists;
try {
fileExists = fs.existsSync(absolute) && fs.statSync(absolute).isFile();
} catch (ex) {
fileExists = false;
}
if (fileExists) {
return absolute;
}
throw new Error('File not found at ' + absolute + '.');
}
export function list(value) {
if (Array.isArray(value)) {
return value;
} else {
return value.trim().split(/\s*,\s*/);
}
}
export function integer(value) {
const int = parseInt(value);
if (!Number.isNaN(int) && int >= 0) {
return int;
}
throw new Error('Must be an integer >= 0.');
}
+140 -112
View File
@@ -1,32 +1,26 @@
const fs = require('fs');
const requireDirectory = require('require-directory');
const xml2js = require('xml2js');
import chalk from 'chalk';
import fs from 'fs';
import * as luxon from 'luxon';
import * as data from './data.js';
import * as frontmatter from './frontmatter.js';
import * as shared from './shared.js';
import * as translator from './translator.js';
const shared = require('./shared');
const settings = require('./settings');
const translator = require('./translator');
export async function parseFilePromise() {
shared.logHeading('Parsing');
const content = await fs.promises.readFile(shared.config.input, 'utf8');
const rssData = await data.load(content);
const allPostData = rssData.child('channel').children('item');
// dynamically requires all frontmatter getters
const frontmatterGetters = requireDirectory(module, './frontmatter', { recurse: false });
async function parseFilePromise(config) {
console.log('\nParsing...');
const content = await fs.promises.readFile(config.input, 'utf8');
const allData = await xml2js.parseStringPromise(content, {
trim: true,
tagNameProcessors: [xml2js.processors.stripPrefix]
});
const channelData = allData.rss.channel[0].item;
const postTypes = getPostTypes(channelData, config);
const posts = collectPosts(channelData, postTypes, config);
const postTypes = getPostTypes(allPostData);
const posts = collectPosts(allPostData, postTypes);
const images = [];
if (config.saveAttachedImages) {
images.push(...collectAttachedImages(channelData));
if (shared.config.saveImages === 'attached' || shared.config.saveImages === 'all') {
images.push(...collectAttachedImages(allPostData));
}
if (config.saveScrapedImages) {
images.push(...collectScrapedImages(channelData, postTypes));
if (shared.config.saveImages === 'scraped' || shared.config.saveImages === 'all') {
images.push(...collectScrapedImages(allPostData, postTypes));
}
mergeImagesIntoPosts(images, posts);
@@ -35,110 +29,135 @@ async function parseFilePromise(config) {
return posts;
}
function getPostTypes(channelData, config) {
if (config.includeOtherTypes) {
// search export file for all post types minus some default types we don't want
// effectively this will be 'post', 'page', and custom post types
const types = channelData
.map(item => item.post_type[0])
.filter(type => !['attachment', 'revision', 'nav_menu_item', 'custom_css', 'customize_changeset'].includes(type));
return [...new Set(types)]; // remove duplicates
} else {
// just plain old vanilla "post" posts
return ['post'];
}
function getPostTypes(allPostData) {
// search export file for all post types minus some specific types we don't want
const postTypes = [...new Set(allPostData // new Set() is used to dedupe array
.map((postData) => postData.childValue('post_type'))
.filter((postType) => ![
'attachment',
'revision',
'nav_menu_item',
'custom_css',
'customize_changeset',
'oembed_cache',
'user_request',
'wp_block',
'wp_global_styles',
'wp_navigation',
'wp_template',
'wp_template_part'
].includes(postType))
)];
// change order to "post", "page", then all custom post types (alphabetically)
prioritizePostType(postTypes, 'page');
prioritizePostType(postTypes, 'post');
return postTypes;
}
function getItemsOfType(channelData, type) {
return channelData.filter(item => item.post_type[0] === type);
function getItemsOfType(allPostData, type) {
return allPostData.filter((item) => item.childValue('post_type') === type);
}
function collectPosts(channelData, postTypes, config) {
// this is passed into getPostContent() for the markdown conversion
const turndownService = translator.initTurndownService();
function collectPosts(allPostData, postTypes) {
let allPosts = [];
postTypes.forEach(postType => {
const postsForType = getItemsOfType(channelData, postType)
.filter(postData => postData.status[0] !== 'trash' && postData.status[0] !== 'draft')
.map(postData => ({
// raw post data, used by frontmatter getters
data: postData,
postTypes.forEach((postType) => {
const postsForType = getItemsOfType(allPostData, postType)
.filter((postData) => postData.childValue('status') !== 'trash')
.filter((postData) => !(postType === 'page' && postData.childValue('post_name') === 'sample-page'))
.map((postData) => buildPost(postData));
// meta data isn't written to file, but is used to help with other things
meta: {
id: getPostId(postData),
slug: getPostSlug(postData),
coverImageId: getPostCoverImageId(postData),
coverImage: undefined, // possibly set later in mergeImagesIntoPosts()
type: postType,
imageUrls: [] // possibly set later in mergeImagesIntoPosts()
},
// contents of the post in markdown
content: translator.getPostContent(postData, turndownService, config)
}));
if (postTypes.length > 1) {
console.log(`${postsForType.length} "${postType}" posts found.`);
if (postsForType.length > 0) {
if (postType === 'post') {
console.log(`${postsForType.length} normal posts found.`);
} else if (postType === 'page') {
console.log(`${postsForType.length} pages found.`);
} else {
console.log(`${postsForType.length} custom "${postType}" posts found.`);
}
}
allPosts.push(...postsForType);
});
if (postTypes.length === 1) {
console.log(allPosts.length + ' posts found.');
}
return allPosts;
}
function getPostId(postData) {
return postData.post_id[0];
function buildPost(data) {
return {
// full raw post data
data,
// body content converted to markdown
content: translator.getPostContent(data.childValue('encoded')),
// particularly useful values for all sorts of things
type: data.childValue('post_type'),
id: data.childValue('post_id'),
isDraft: data.childValue('status') === 'draft',
slug: decodeURIComponent(data.childValue('post_name')),
date: getPostDate(data),
coverImageId: getPostMetaValue(data, '_thumbnail_id'),
// these are possibly set later in mergeImagesIntoPosts()
coverImage: undefined,
imageUrls: []
};
}
function getPostSlug(postData) {
return decodeURIComponent(postData.post_name[0]);
function getPostDate(data) {
const date = luxon.DateTime.fromRFC2822(data.childValue('pubDate'), { zone: shared.config.timezone });
return date.isValid ? date : undefined;
}
function getPostCoverImageId(postData) {
if (postData.postmeta === undefined) {
return undefined;
function getPostMetaValue(data, key) {
const metas = data.children('postmeta');
const meta = metas.find((meta) => meta.childValue('meta_key') === key);
return meta ? meta.childValue('meta_value') : undefined;
}
const postmeta = postData.postmeta.find(postmeta => postmeta.meta_key[0] === '_thumbnail_id');
const id = postmeta ? postmeta.meta_value[0] : undefined;
return id;
}
function collectAttachedImages(channelData) {
const images = getItemsOfType(channelData, 'attachment')
function collectAttachedImages(allPostData) {
const images = getItemsOfType(allPostData, 'attachment')
// filter to certain image file types
.filter(attachment => attachment.attachment_url && (/\.(gif|jpe?g|png|webp)$/i).test(attachment.attachment_url[0]))
.map(attachment => ({
id: attachment.post_id[0],
postId: attachment.post_parent[0],
url: attachment.attachment_url[0]
.filter((attachment) => {
const url = attachment.childValue('attachment_url');
return url && (/\.(gif|jpe?g|png|webp)$/i).test(url);
})
.map((attachment) => ({
id: attachment.childValue('post_id'),
postId: attachment.optionalChildValue('post_parent') ?? 'nope', // may not exist (cover image in a squarespace export, for example)
url: attachment.childValue('attachment_url')
}));
console.log(images.length + ' attached images found.');
return images;
}
function collectScrapedImages(channelData, postTypes) {
function collectScrapedImages(allPostData, postTypes) {
const images = [];
postTypes.forEach(postType => {
getItemsOfType(channelData, postType).forEach(postData => {
const postId = postData.post_id[0];
const postContent = postData.encoded[0];
const postLink = postData.link[0];
postTypes.forEach((postType) => {
getItemsOfType(allPostData, postType).forEach((postData) => {
const postId = postData.childValue('post_id');
const postContent = postData.childValue('encoded');
const scrapedUrls = [...postContent.matchAll(/<img(?=\s)[^>]+?(?<=\s)src="(.+?)"[^>]*>/gi)].map((match) => match[1]);
scrapedUrls.forEach((scrapedUrl) => {
let url;
if (isAbsoluteUrl(scrapedUrl)) {
url = scrapedUrl;
} else {
const postLink = postData.childValue('link');
if (isAbsoluteUrl(postLink)) {
url = new URL(scrapedUrl, postLink).href;
} else {
throw new Error(`Unable to determine absolute URL from scraped image URL '${scrapedUrl}' and post link URL '${postLink}'.`);
}
}
const matches = [...postContent.matchAll(/<img[^>]*src="(.+?\.(?:gif|jpe?g|png|webp))"[^>]*>/gi)];
matches.forEach(match => {
// base the matched image URL relative to the post URL
const url = new URL(match[1], postLink).href;
images.push({
id: -1,
postId: postId,
id: 'nope', // scraped images don't have an id
postId,
url
});
});
@@ -150,43 +169,52 @@ function collectScrapedImages(channelData, postTypes) {
}
function mergeImagesIntoPosts(images, posts) {
images.forEach(image => {
posts.forEach(post => {
images.forEach((image) => {
posts.forEach((post) => {
let shouldAttach = false;
// this image was uploaded as an attachment to this post
if (image.postId === post.meta.id) {
if (image.postId === post.id) {
shouldAttach = true;
}
// this image was set as the featured image for this post
if (image.id === post.meta.coverImageId) {
if (image.id === post.coverImageId) {
shouldAttach = true;
post.meta.coverImage = shared.getFilenameFromUrl(image.url);
post.coverImage = shared.getFilenameFromUrl(image.url);
}
if (shouldAttach && !post.meta.imageUrls.includes(image.url)) {
post.meta.imageUrls.push(image.url);
if (shouldAttach && !post.imageUrls.includes(image.url)) {
post.imageUrls.push(image.url);
}
});
});
}
function populateFrontmatter(posts) {
posts.forEach(post => {
const frontmatter = {};
settings.frontmatter_fields.forEach(field => {
posts.forEach((post) => {
post.frontmatter = {};
shared.config.frontmatterFields.forEach((field) => {
const [key, alias] = field.split(':');
let frontmatterGetter = frontmatterGetters[key];
let frontmatterGetter = frontmatter[key];
if (!frontmatterGetter) {
throw `Could not find a frontmatter getter named "${key}".`;
}
frontmatter[alias || key] = frontmatterGetter(post);
post.frontmatter[alias ?? key] = frontmatterGetter(post);
});
post.frontmatter = frontmatter;
});
}
exports.parseFilePromise = parseFilePromise;
function prioritizePostType(postTypes, postType) {
const index = postTypes.indexOf(postType);
if (index !== -1) {
postTypes.splice(index, 1);
postTypes.unshift(postType);
}
}
function isAbsoluteUrl(url) {
return (/^https?:\/\//i).test(url);
}
+158
View File
@@ -0,0 +1,158 @@
import * as inquirer from '@inquirer/prompts';
export function load() {
// questions with a description are displayed in command line help
// questions with a prompt are included in the wizard (if not set on the command line)
return [
{
name: 'input',
type: 'file-path',
description: 'Path to WordPress export file',
default: 'export.xml',
prompt: inquirer.input
},
{
name: 'post-folders',
type: 'boolean',
description: 'Put each post into its own folder',
default: true,
choices: [
{
name: 'Yes',
value: true
},
{
name: 'No',
value: false
}
],
isPathQuestion: true,
prompt: inquirer.select
},
{
name: 'prefix-date',
type: 'boolean',
description: 'Add date prefix to posts',
default: false,
choices: [
{
name: 'Yes',
value: true
},
{
name: 'No',
value: false
}
],
isPathQuestion: true,
prompt: inquirer.select
},
{
name: 'date-folders',
type: 'choice',
description: 'Organize posts into date folders',
default: 'none',
choices: [
{
name: 'Year folders',
value: 'year'
},
{
name: 'Year and month folders',
value: 'year-month'
},
{
name: 'No',
value: 'none'
}
],
isPathQuestion: true,
prompt: inquirer.select
},
{
name: 'save-images',
type: 'choice',
description: 'Save images',
default: 'all',
choices: [
{
name: 'Images attached to posts',
value: 'attached'
},
{
name: 'Images scraped from post body content',
value: 'scraped'
},
{
name: 'All Images',
value: 'all'
},
{
name: 'No',
value: 'none'
}
],
prompt: inquirer.select
},
{
name: 'wizard',
type: 'boolean',
description: 'Use wizard',
default: true
},
{
name: 'output',
type: 'folder-path',
description: 'Path to output folder',
default: 'output'
},
{
name: 'frontmatter-fields',
type: 'list',
description: 'Frontmatter fields',
default: 'title,date,categories,tags,coverImage,draft'
},
{
name: 'request-delay',
type: 'integer',
description: 'Delay between image file requests',
default: 500
},
{
name: 'write-delay',
type: 'integer',
description: 'Delay between writing markdown files',
default: 25
},
{
name: 'timezone',
type: 'string',
description: 'Timezone to apply to date',
default: 'utc'
},
{
name: 'include-time',
type: 'boolean',
description: 'Include time with frontmatter date',
default: false
},
{
name: 'date-format',
type: 'string',
description: 'Frontmatter date format string',
default: ''
},
{
name: 'quote-date',
type: 'boolean',
description: 'Wrap frontmatter date in quotes',
default: false
},
{
name: 'strict-ssl',
type: 'boolean',
description: 'Use strict SSL',
default: true
}
];
}
-40
View File
@@ -1,40 +0,0 @@
// Which fields to include in frontmatter. Look in /src/frontmatter to see available fields.
// Order is preserved. If a field has an empty value, it will not be included. You can rename a
// field by providing an alias after a ':'. For example, 'date:created' will include 'date' in
// frontmatter, but renamed to 'created'.
exports.frontmatter_fields = [
'title',
'date',
'categories',
'tags',
'coverImage'
];
// Time in ms to wait between requesting image files. Increase this if you see timeouts or
// server errors.
exports.image_file_request_delay = 500;
// Time in ms to wait between saving Markdown files. Increase this if your file system becomes
// overloaded.
exports.markdown_file_write_delay = 25;
// Enable this to include time with post dates. For example, "2020-12-25" would become
// "2020-12-25T11:20:35.000Z".
exports.include_time_with_date = false;
// Override post date formatting with a custom formatting string (for example: 'yyyy LLL dd').
// Tokens are documented here: https://moment.github.io/luxon/#/parsing?id=table-of-tokens. If
// set, this takes precedence over include_time_with_date.
exports.custom_date_formatting = '';
// Specify the timezone used for post dates. See available zone values and examples here:
// https://moment.github.io/luxon/#/zones?id=specifying-a-zone.
exports.custom_date_timezone = 'utc';
// Categories to be excluded from post frontmatter. This does not filter out posts themselves,
// just the categories listed in their frontmatter.
exports.filter_categories = ['uncategorized'];
// Strict SSL is enabled as the safe default when downloading images, but will not work with
// self-signed servers. You can disable it if you're getting a "self-signed certificate" error.
exports.strict_ssl = true;
+74 -3
View File
@@ -1,4 +1,77 @@
function getFilenameFromUrl(url) {
import chalk from 'chalk';
import path from 'path';
// simple data store, populated via intake, used everywhere
export const config = {};
export function camelCase(str) {
return str.replace(/-(.)/g, (match) => match[1].toUpperCase());
}
export function getSlugWithFallback(post) {
return post.slug ? post.slug : 'id-' + post.id;
}
export function logHeading(text) {
console.log(`\n${chalk.cyan(text + '...')}`);
}
export function buildPostPath(post, overrideConfig) {
const pathConfig = overrideConfig ?? config;
// start with output folder
const pathSegments = [pathConfig.output];
// add folder for post type if exists
if (post.type) {
switch (post.type) {
case 'post':
pathSegments.push('posts');
break;
case 'page':
pathSegments.push('pages');
break;
default:
pathSegments.push('custom');
pathSegments.push(post.type);
}
}
// add drafts folder if this is a draft post
if (post.isDraft) {
pathSegments.push('_drafts');
}
// add folders for date year/month as appropriate
if (post.date) {
if (pathConfig.dateFolders === 'year' || pathConfig.dateFolders === 'year-month') {
pathSegments.push(post.date.toFormat('yyyy'));
}
if (pathConfig.dateFolders === 'year-month') {
pathSegments.push(post.date.toFormat('LL'));
}
}
// get slug with fallback
let slug = getSlugWithFallback(post);
// prepend date to slug as appropriate
if (pathConfig.prefixDate && post.date) {
slug = post.date.toFormat('yyyy-LL-dd') + '-' + slug;
}
// use slug as folder or filename as specified
if (pathConfig.postFolders) {
pathSegments.push(slug, 'index.md');
} else {
pathSegments.push(slug + '.md');
}
return path.join(...pathSegments);
}
export function getFilenameFromUrl(url) {
let filename = url.split('/').slice(-1)[0];
try {
filename = decodeURIComponent(filename)
@@ -8,5 +81,3 @@ function getFilenameFromUrl(url) {
}
return filename;
}
exports.getFilenameFromUrl = getFilenameFromUrl;
+27 -15
View File
@@ -1,5 +1,9 @@
const turndown = require('turndown');
const turndownPluginGfm = require('turndown-plugin-gfm');
import turndownPluginGfm from '@guyplusplus/turndown-plugin-gfm';
import turndown from 'turndown';
import * as shared from './shared.js';
// init single reusable turndown service object upon import
const turndownService = initTurndownService();
function initTurndownService() {
const turndownService = new turndown({
@@ -10,15 +14,17 @@ function initTurndownService() {
turndownService.use(turndownPluginGfm.tables);
turndownService.remove(['style']); // <style> contents get dumped as plain text, would rather remove
// preserve embedded tweets
turndownService.addRule('tweet', {
filter: node => node.nodeName === 'BLOCKQUOTE' && node.getAttribute('class') === 'twitter-tweet',
filter: (node) => node.nodeName === 'BLOCKQUOTE' && node.getAttribute('class') === 'twitter-tweet',
replacement: (content, node) => '\n\n' + node.outerHTML
});
// preserve embedded codepens
turndownService.addRule('codepen', {
filter: node => {
filter: (node) => {
// codepen embed snippets have changed over the years
// but this series of checks should find the commonalities
return (
@@ -30,6 +36,14 @@ function initTurndownService() {
replacement: (content, node) => '\n\n' + node.outerHTML
});
// <div> within <a> can cause extra whitespace that wreck markdown links, so this removes them
turndownService.addRule('a', {
filter: 'a',
replacement: (content) => {
return content.replace(/<\/?div[^>]*>/gi, '');
}
});
// preserve embedded scripts (for tweets, codepens, gists, etc.)
turndownService.addRule('script', {
filter: 'script',
@@ -73,7 +87,7 @@ function initTurndownService() {
// preserve <figcaption>
turndownService.addRule('figcaption', {
filter: 'figcaption',
replacement: (content, node) => {
replacement: (content) => {
// extra newlines are necessary for markdown and HTML to render correctly together
return '\n\n<figcaption>\n\n' + content + '\n\n</figcaption>\n\n';
}
@@ -81,12 +95,12 @@ function initTurndownService() {
// convert <pre> into a code block with language when appropriate
turndownService.addRule('pre', {
filter: node => {
filter: (node) => {
// a <pre> with <code> inside will already render nicely, so don't interfere
return node.nodeName === 'PRE' && !node.querySelector('code');
},
replacement: (content, node) => {
const language = node.getAttribute('data-wetm-language') || '';
const language = node.getAttribute('data-wetm-language') ?? '';
return '\n\n```' + language + '\n' + node.textContent + '\n```\n\n';
}
});
@@ -94,18 +108,16 @@ function initTurndownService() {
return turndownService;
}
function getPostContent(postData, turndownService, config) {
let content = postData.encoded[0];
export function getPostContent(content) {
// insert an empty div element between double line breaks
// this nifty trick causes turndown to keep adjacent paragraphs separated
// without mucking up content inside of other elements (like <code> blocks)
content = content.replace(/(\r?\n){2}/g, '\n<div></div>\n');
if (config.saveScrapedImages) {
if (shared.config.saveImages === 'scraped' || shared.config.saveImages === 'all') {
// writeImageFile() will save all content images to a relative /images
// folder so update references in post content to match
content = content.replace(/(<img[^>]*src=").*?([^/"]+\.(?:gif|jpe?g|png|webp))("[^>]*>)/gi, '$1images/$2$3');
content = content.replace(/(<img(?=\s)[^>]+?(?<=\s)src=")[^"]*?([^/"]+)("[^>]*>)/gi, '$1images/$2$3');
}
// preserve "more" separator, max one per post, optionally with custom label
@@ -122,8 +134,8 @@ function getPostContent(postData, turndownService, config) {
// clean up extra spaces in list items
content = content.replace(/(-|\d+\.) +/g, '$1 ');
// collapse excessive newlines (can happen with a lot of <div>)
content = content.replace(/(\r?\n){3,}/g, '\n\n');
return content;
}
exports.initTurndownService = initTurndownService;
exports.getPostContent = getPostContent;
-201
View File
@@ -1,201 +0,0 @@
const camelcase = require('camelcase');
const commander = require('commander');
const fs = require('fs');
const inquirer = require('inquirer');
const path = require('path');
const package = require('../package.json');
// all user options for command line and wizard are declared here
const options = [
// wizard must always be first
{
name: 'wizard',
type: 'boolean',
description: 'Use wizard',
default: true
},
{
name: 'input',
type: 'file',
description: 'Path to WordPress export file',
default: 'export.xml'
},
{
name: 'output',
type: 'folder',
description: 'Path to output folder',
default: 'output'
},
{
name: 'year-folders',
aliases: ['yearfolders', 'yearmonthfolders'],
type: 'boolean',
description: 'Create year folders',
default: false
},
{
name: 'month-folders',
aliases: ['yearmonthfolders'],
type: 'boolean',
description: 'Create month folders',
default: false
},
{
name: 'post-folders',
aliases: ['postfolders'],
type: 'boolean',
description: 'Create a folder for each post',
default: true
},
{
name: 'prefix-date',
aliases: ['prefixdate'],
type: 'boolean',
description: 'Prefix post folders/files with date',
default: false
},
{
name: 'save-attached-images',
aliases: ['saveimages'],
type: 'boolean',
description: 'Save images attached to posts',
default: true
},
{
name: 'save-scraped-images',
aliases: ['addcontentimages'],
type: 'boolean',
description: 'Save images scraped from post body content',
default: true
},
{
name: 'include-other-types',
type: 'boolean',
description: 'Include custom post types and pages',
default: false
}
];
async function getConfig(argv) {
extendOptionsData();
const unaliasedArgv = replaceAliases(argv);
const opts = parseCommandLine(unaliasedArgv);
let answers;
if (opts.wizard) {
console.log('\nStarting wizard...');
const questions = options.map(option => ({
when: option.name !== 'wizard' && !option.isProvided,
name: camelcase(option.name),
type: option.prompt,
message: option.description + '?',
default: option.default,
// these are not used for all option types and that's fine
filter: option.coerce,
validate: option.validate
}));
answers = await inquirer.prompt(questions);
} else {
console.log('\nSkipping wizard...');
answers = {};
}
const config = { ...opts, ...answers };
return config;
}
function extendOptionsData() {
// add more data to each option based on its type
const map = {
boolean: {
prompt: 'confirm',
coerce: coerceBoolean,
},
file: {
prompt: 'input',
coerce: coercePath,
validate: validateFile
},
folder: {
prompt: 'input',
coerce: coercePath
}
};
options.forEach(option => {
Object.assign(option, map[option.type]);
});
}
function replaceAliases(argv) {
let paths = argv.slice(0, 2);
let replaced = [];
let unmodified = [];
argv.slice(2).forEach(arg => {
let aliasFound = false;
// this loop does not short circuit because an alias can map to multiple options
options.forEach(option => {
const aliases = option.aliases || [];
aliases.forEach(alias => {
if (arg.includes('--' + alias)) {
replaced.push(arg.replace('--' + alias, '--' + option.name));
aliasFound = true;
}
});
});
if (!aliasFound) {
unmodified.push(arg);
}
});
return [...paths, ...replaced, ...unmodified];
}
function parseCommandLine(argv) {
// setup for help output
commander.program
.name('node index.js')
.version('v' + package.version, '-v, --version', 'Display version number')
.helpOption('-h, --help', 'See the thing you\'re looking at right now')
.addHelpText('after', '\nMore documentation is at https://github.com/lonekorean/wordpress-export-to-markdown');
options.forEach(input => {
const flag = '--' + input.name + ' <' + input.type + '>';
const coerce = (value) => {
// commander only calls coerce when an input is provided on the command line, which
// makes for an easy way to flag (for later) if it should be excluded from the wizard
input.isProvided = true;
return input.coerce(value);
};
commander.program.option(flag, input.description, coerce, input.default);
});
commander.program.parse(argv);
return commander.program.opts();
}
function coerceBoolean(value) {
return !['false', 'no', '0'].includes(value.toLowerCase());
}
function coercePath(value) {
return path.normalize(value);
}
function validateFile(value) {
let isValid;
try {
isValid = fs.existsSync(value) && fs.statSync(value).isFile();
} catch (ex) {
isValid = false;
}
return isValid ? true : 'Unable to find file: ' + path.resolve(value);
}
exports.getConfig = getConfig;
+84 -100
View File
@@ -1,36 +1,34 @@
const axios = require('axios');
const chalk = require('chalk');
const fs = require('fs');
const http = require('http');
const https = require('https');
const luxon = require('luxon');
const path = require('path');
import axios from 'axios';
import chalk from 'chalk';
import fs from 'fs';
import http from 'http';
import https from 'https';
import * as luxon from 'luxon';
import path from 'path';
import * as shared from './shared.js';
const shared = require('./shared');
const settings = require('./settings');
async function writeFilesPromise(posts, config) {
await writeMarkdownFilesPromise(posts, config);
await writeImageFilesPromise(posts, config);
export async function writeFilesPromise(posts) {
await writeMarkdownFilesPromise(posts);
await writeImageFilesPromise(posts);
}
async function processPayloadsPromise(payloads, loadFunc) {
const promises = payloads.map(payload => new Promise((resolve, reject) => {
const promises = payloads.map((payload) => new Promise((resolve, reject) => {
setTimeout(async () => {
try {
const data = await loadFunc(payload.item);
await writeFile(payload.destinationPath, data);
console.log(chalk.green('[OK]') + ' ' + payload.name);
logPayloadResult(payload);
resolve();
} catch (ex) {
console.log(chalk.red('[FAILED]') + ' ' + payload.name + ' ' + chalk.red('(' + ex.toString() + ')'));
logPayloadResult(payload, ex.message);
reject();
}
}, payload.delay);
}));
const results = await Promise.allSettled(promises);
const failedCount = results.filter(result => result.status === 'rejected').length;
const failedCount = results.filter((result) => result.status === 'rejected').length;
if (failedCount === 0) {
console.log('Done, got them all!');
} else {
@@ -43,33 +41,31 @@ async function writeFile(destinationPath, data) {
await fs.promises.writeFile(destinationPath, data);
}
async function writeMarkdownFilesPromise(posts, config) {
async function writeMarkdownFilesPromise(posts) {
// package up posts into payloads
let skipCount = 0;
let existingCount = 0;
let delay = 0;
const payloads = posts.flatMap(post => {
const destinationPath = getPostPath(post, config);
const payloads = posts.flatMap((post) => {
const destinationPath = shared.buildPostPath(post);
if (checkFile(destinationPath)) {
// already exists, don't need to save again
skipCount++;
existingCount++;
return [];
} else {
const payload = {
item: post,
name: (config.includeOtherTypes ? post.meta.type + ' - ' : '') + post.meta.slug,
type: post.type,
name: shared.getSlugWithFallback(post),
destinationPath,
delay
};
delay += settings.markdown_file_write_delay;
delay += shared.config.writeDelay;
return [payload];
}
});
const remainingCount = payloads.length;
if (remainingCount + skipCount === 0) {
console.log('\nNo posts to save...');
} else {
console.log(`\nSaving ${remainingCount} posts (${skipCount} already exist)...`);
logSavingMessage('posts', existingCount, payloads.length);
if (payloads.length > 0) {
await processPayloadsPromise(payloads, loadMarkdownFilePromise);
}
}
@@ -84,9 +80,25 @@ async function loadMarkdownFilePromise(post) {
// array of one or more strings
outputValue = value.reduce((list, item) => `${list}\n - "${item}"`, '');
}
} else if (Number.isInteger(value)) {
// output unquoted
outputValue = value.toString();
} else if (value instanceof luxon.DateTime) {
if (shared.config.dateFormat) {
outputValue = value.toFormat(shared.config.dateFormat);
} else {
outputValue = shared.config.includeTime ? value.toISO() : value.toISODate();
}
if (shared.config.quoteDate) {
outputValue = `"${outputValue}"`;
}
} else if (typeof value === 'boolean') {
// output unquoted
outputValue = value.toString();
} else {
// single string value
const escapedValue = (value || '').replace(/"/g, '\\"');
const escapedValue = (value ?? '').replace(/"/g, '\\"');
if (escapedValue.length > 0) {
outputValue = `"${escapedValue}"`;
}
@@ -101,38 +113,36 @@ async function loadMarkdownFilePromise(post) {
return output;
}
async function writeImageFilesPromise(posts, config) {
async function writeImageFilesPromise(posts) {
// collect image data from all posts into a single flattened array of payloads
let skipCount = 0;
let existingCount = 0;
let delay = 0;
const payloads = posts.flatMap(post => {
const postPath = getPostPath(post, config);
const payloads = posts.flatMap((post) => {
const postPath = shared.buildPostPath(post);
const imagesDir = path.join(path.dirname(postPath), 'images');
return post.meta.imageUrls.flatMap(imageUrl => {
return post.imageUrls.flatMap((imageUrl) => {
const filename = shared.getFilenameFromUrl(imageUrl);
const destinationPath = path.join(imagesDir, filename);
if (checkFile(destinationPath)) {
// already exists, don't need to save again
skipCount++;
existingCount++;
return [];
} else {
const payload = {
item: imageUrl,
type: 'image',
name: filename,
destinationPath,
delay
};
delay += settings.image_file_request_delay;
delay += shared.config.requestDelay;
return [payload];
}
});
});
const remainingCount = payloads.length;
if (remainingCount + skipCount === 0) {
console.log('\nNo images to download and save...');
} else {
console.log(`\nDownloading and saving ${remainingCount} images (${skipCount} already exist)...`);
logSavingMessage('images', existingCount, payloads.length);
if (payloads.length > 0) {
await processPayloadsPromise(payloads, loadImageFilePromise);
}
}
@@ -141,7 +151,7 @@ async function loadImageFilePromise(imageUrl) {
// only encode the URL if it doesn't already have encoded characters
const url = (/%[\da-f]{2}/i).test(imageUrl) ? imageUrl : encodeURI(imageUrl);
const config = {
const requestConfig = {
method: 'get',
url,
headers: {
@@ -150,70 +160,44 @@ async function loadImageFilePromise(imageUrl) {
responseType: 'arraybuffer'
};
if (!settings.strict_ssl) {
if (!shared.config.strictSsl) {
// custom agents to disable SSL errors (adding both http and https, just in case)
config.httpAgent = new http.Agent({ rejectUnauthorized: false });
config.httpsAgent = new https.Agent({ rejectUnauthorized: false });
requestConfig.httpAgent = new http.Agent({ rejectUnauthorized: false });
requestConfig.httpsAgent = new https.Agent({ rejectUnauthorized: false });
}
let buffer;
try {
const response = await axios(config);
buffer = Buffer.from(response.data, 'binary');
} catch (ex) {
if (ex.response) {
// request was made, but server responded with an error status code
throw 'StatusCodeError: ' + ex.response.status;
} else {
// something else went wrong, rethrow
throw ex;
}
}
const response = await axios(requestConfig);
const buffer = Buffer.from(response.data, 'binary');
return buffer;
}
function getPostPath(post, config) {
let dt;
if (settings.custom_date_formatting) {
dt = luxon.DateTime.fromFormat(post.frontmatter.date, settings.custom_date_formatting);
} else {
dt = luxon.DateTime.fromISO(post.frontmatter.date);
}
// start with base output dir
const pathSegments = [config.output];
// create segment for post type if we're dealing with more than just "post"
if (config.includeOtherTypes) {
pathSegments.push(post.meta.type);
}
if (config.yearFolders) {
pathSegments.push(dt.toFormat('yyyy'));
}
if (config.monthFolders) {
pathSegments.push(dt.toFormat('LL'));
}
// create slug fragment, possibly date prefixed
let slugFragment = post.meta.slug;
if (config.prefixDate) {
slugFragment = dt.toFormat('yyyy-LL-dd') + '-' + slugFragment;
}
// use slug fragment as folder or filename as specified
if (config.postFolders) {
pathSegments.push(slugFragment, 'index.md');
} else {
pathSegments.push(slugFragment + '.md');
}
return path.join(...pathSegments);
}
function checkFile(path) {
return fs.existsSync(path);
}
exports.writeFilesPromise = writeFilesPromise;
function logSavingMessage(things, existingCount, remainingCount) {
shared.logHeading(`Saving ${things}`);
if (existingCount + remainingCount === 0) {
console.log(`No ${things} to save.`);
} else if (existingCount === 0) {
console.log(`${remainingCount} ${things} to save.`);
} else if (remainingCount === 0) {
console.log(`All ${existingCount} ${things} already saved.`);
} else {
console.log(`${existingCount} ${things} already saved, ${remainingCount} remaining.`);
}
}
function logPayloadResult(payload, errorMessage) {
const messageBits = [
errorMessage ? chalk.red('✗') : chalk.green('✓'),
chalk.gray(`[${payload.type}]`),
payload.name
];
if (errorMessage) {
messageBits.push(chalk.red(`(${errorMessage})`));
}
console.log(messageBits.join(' '));
}