Well, not this one. But this
one
is! How? Let’s take a closer look at Bluesky and the AT
Protocol that underpins it.
Note: I communicated with the Bluesky team prior to the publishing of this
post. While the functionality described is not the intended use of the
application, it is known behavior and does not constitue a vulnerability
disclosure process. My main motivation for reaching out to them was because I
like the folks and don’t want to make their lives harder.
Being able to host a website on Bluesky really has very little to do with
Bluesky itself. I happen to use Bluesky for hosting my Personal Data Server
(PDS), but all of
the APIs leveraged in uploading the site contents are defined at the AT Protocol
level and implemented by a PDS. Bluesky offers access to my PDS via their PDS
entryway, which allows for
the many (have you heard that they are growing by a million users per
day?)
PDS instances they run to be exposed via the bsky.social
domain. That being
said, individual PDS instances can be accessed directly, and if you clicked the
link at the top of this post to access the Bluesky hosted website, then you have
already visited mine at porcini.us-east.host.bsky.network
.
Most social applications, and many applications in general for that matter,
broadly have two primary types of content:
records and
blobs. Records are the core entity
types that users create. They generally have some defined structure and
metadata, and they may reference other records or content. Blobs are typically
larger unstructured data, such as media assets, that may be uploaded by a user,
but are exposed via a record referencing them. For example, on Bluesky a user
may upload an image, then create a post that references it. From an end-user
perspective, these two operations appear to be one action, but they are
typically decoupled at the API level.
This decoupling is described in detail in the AT Protocol blob
specification.
Blob files are uploaded and distributed separately from records. Blobs are
authoritatively stored by the account’s PDS instance, but views are commonly
served by CDNs associated with individual applications (“AppViews”), to reduce
traffic on the PDS. CDNs may serve transformed (resized, transcoded, etc)
versions of the original blob.
Later on, the specification details how blob lifecylce is to be managed.
Blobs must be uploaded to the PDS before a record can be created referencing
that blob. Note that the server does not know the intended Lexicon when
receiving an upload, so can only apply generic blob limits and restrictions at
initial upload time, and then enforce Lexicon-defined limits later when the
record is created.
Reading this section is what initially got my wheels turning. While Bluesky has
a limited set of media asset types that can be referenced by posts, posts are
just one record type that is defined by the Bluesky
lexicon (app.bsky.*
). Records, on the
other hand, are defined in the AT Protocol lexicon (com.atproto.*
) and are
designed to accommodate creating any type of record defined by any lexicon.
Because different types of blobs may be relevant for other lexicons, the
specification highlights that restrictions cannot be enforced at time of upload.
Instead blobs are not made available until they are referenced, at which point
the validation can be performed based on the lexicon of the record type.
After a successful upload, blobs are placed in temporary storage. They are not
accessible for download or distribution while in this state. Servers should
“garbage collect” (delete) un-referenced temporary blobs after an appropriate
time span (see implementation guidelines below). Blobs which are in temporary
storage should not be included in thelistBlobs
output.
The upload blob can now be referenced from records by including the returned
blob metadata in a record. When processing record creation, the server
extracts the set of all referenced blobs, and checks that they are either
already referenced, or are in temporary storage. Once the record creation
succeeds, the server makes the blob publicly accessible.
However, applying validation does not mean that Bluesky’s restrictions will
necessarily be applied. A record that references a blob could very well be of a
type defined by a different lexicon, or, as we’ll see later on, part of a
sub-schema
enabled by an open union in the Bluesky lexicon. Let’s see how this works in
practice.
In order to perform data creation operations against a PDS, an access token must
be acquired for authentication. The
com.atproto.server.createSession
XRPC method can be used to exchange
user credentials for a token. In the following curl
command, I used
danielmangum.com
as $BSKY_HANDLE
and my password as $BSKY_PWD
.
curl -X POST 'https://bsky.social/xrpc/com.atproto.server.createSession'
-H 'Content-Type: application/json'
-d '{"identifier": "'"$BSKY_HANDLE"'", "password": "'"$BSKY_PWD"'"}'
The response includes an accessJWT
field, which will be used as $ACCESS_JWT
in subsequent operations. As described in the blob specification, a blob must be
uploaded prior to it being referenced. I wanted to verify that the blob was not
present in the
com.atproto.sync.listBlobs
output, or accessible via the
com.atproto.sync.getBlob
methods immediately after upload, so I checked how many blobs were currently
being returned.
curl -s 'https://bsky.social/xrpc/com.atproto.sync.listBlobs?did='"$DID"''
-H 'Authorization: Bearer '"$ACCESS_JWT"'' | jq -r '.cids | length'
The decentralized identifier ($DID
) used
above can be obtained from the createSession
output as well. It is the
underlying identifier for an account. Every Bluesky handle resolves to a
DID.
The
com.atproto.repo.uploadBlob
method is used to upload a blob to a repository. The content of the website is a
simple index.html
file.
<h1>This Website is Hosted on Blueskyh1>
<p>
This website is just a blob uploaded to Bluesky via the API. Curious about how
this works? Check out the write-up on <a
href="https://danielmangum.com/posts/this-website-is-hosted-on-bluesky/">danielmangum.coma>.
p>
To upload it, I used the following command.
curl -X POST 'https://bsky.social/xrpc/com.atproto.repo.uploadBlob'
-H 'Authorization: Bearer '"$ACCESS_JWT"''
-H 'Content-Type: text/html'
--data-binary '@index.html'
{
"blob": {
"$type": "blob",
"ref": {
"$link": "bafkreic5fmelmhqoqxfjz2siw5ey43ixwlzg5gvv2pkkz7o25ikepv4zeq"
},
"mimeType": "text/html",
"size": 268
}
}
The returned $link
can be used as the Content Identifier
(cid
) when fetching the
blob via the getBlob
method. However, according to the specification, because
this blob has to be referenced, it shouldn’t be visible. I checked to see if I
could access it with the following command.
curl -L 'https://bsky.social/xrpc/com.atproto.sync.getBlob?did='"$DID"'&cid='"$LINK"''
{
"error": "InternalServerError",
"message": "Internal Server Error"
}
Not the error I was expecting, but it looks like I indeed cannot access it. I
was also able to determine that it had not beed added to the listBlobs
output.
curl -s 'https://bsky.social/xrpc/com.atproto.sync.listBlobs?did='"$DID"''
-H 'Authorization: Bearer '"$ACCESS_JWT"'' | jq -r '.cids | length'
Blobs can be referenced in app.bsky.feed.post
records
on Bluesky by including an embedded
image. However,
the app.bsky.embed.image
schema
retricts the MIME
type to those
prefixed with image/*
. We can see this validation in action if we try to
create a post with an embedded image.
curl -X POST 'https://bsky.social/xrpc/com.atproto.repo.createRecord'
-H 'Authorization: Bearer '"$ACCESS_JWT"''
-H 'Content-Type: application/json'
-d '{
"repo": "danielmangum.com",
"collection": "app.bsky.feed.post",
"record": {
"$type": "app.bsky.feed.post",
"text": "testing123",
"createdAt": "2024-11-23T05:49:35.422015Z",
"embed": {
"$type": "app.bsky.embed.images",
"images": [
{
"alt": "that is not an image that is a website!",
"image": {
"$type": "blob",
"ref": {
"$link": "bafkreidphtuvbzublyzacxukmmk2ikiur5ahme75fegokbhh26o4wfzvry"
},
"mimeType": "text/html",
"size": 21
}
}
]
}
}
}'
{
"error": "InvalidMimeType",
"message": "Wrong type of file. It is text/html but it must match image/*."
}
For completeness, I also tried specifying the mimeType
as image/jpeg
and
verified that the PDS also validates that the blob reference MIME type matches
the blob.
{
"error": "InvalidMimeType",
"message": "Referenced Mimetype does not match stored blob. Expected: text/html, Got: image/jpeg"
}
However, the blob
$type
is part of the AT Protocol data
model and not specific to
Bluesky. Because Bluesky’s PDS implementation is open source, we can see exactly
how a BlobRef
is
defined.
export class BlobRef {
public original: JsonBlobRef
constructor(
public ref: CID,
public mimeType: string,
public size: number,
original?: JsonBlobRef,
) {
this.original = original ?? {
$type: 'blob',
ref,
mimeType,
size,
}
}
We can also see exactly how the PDS identifies blobs in a
record.
export const findBlobRefs = (
val: LexValue,
path: string[] = [],
layer = 0,
): FoundBlobRef[] => {
if (layer > 32) {
return []
}
// walk arrays
if (Array.isArray(val)) {
return val.flatMap((item) => findBlobRefs(item, path, layer + 1))
}
// objects
if (val && typeof val === 'object') {
// convert blobs, leaving the original encoding so that we don't change CIDs on re-encode
if (val instanceof BlobRef) {
return [
{
ref: val,
path,
},
]
}
// retain cids & bytes
if (CID.asCID(val) || val instanceof Uint8Array) {
return []
}
return Object.entries(val).flatMap(([key, item]) =>
findBlobRefs(item, [...path, key], layer + 1),
)
}
// pass through
return []
}
The important thing to notice is that identifying blob references does require
the presence of a lexicon schema. findBlobRefs
recursively navigates a
LexValue
and looks for $type: blob
. In order to support new lexicons over
time, the PDS needs to be able to handle lexicons that it doesn’t know about.
Because blobs are a fundamental component of so many applications, these new
lexicons also need to be able to leverage them. To put this into action, I
attempted to create a record of type com.danielmangum.hack.website
, which
included a reference to the uploaded HTML blob.
curl -X POST 'https://bsky.social/xrpc/com.atproto.repo.createRecord'
-H 'Authorization: Bearer '"$ACCESS_JWT"''
-H 'Content-Type: application/json'
-d '{
"repo": "danielmangum.com",
"collection": "com.danielmangum.hack.website",
"record": {
"$type": "com.danielmangum.hack.website",
"website": {
"$type": "blob",
"ref": {
"$link": "bafkreic5fmelmhqoqxfjz2siw5ey43ixwlzg5gvv2pkkz7o25ikepv4zeq"
},
"mimeType": "text/html",
"size": 268
}
}
}'
{
"uri": "at://did:plc:j22nebhg6aek3kt2mex5ng7e/com.danielmangum.hack.website/3lbnguuzckm2u",
"cid": "bafyreicjcptshc7lmgb7abxlvcb5fmqqjdj6neie23szyum7rcaowmm5qm",
"commit": {
"cid": "bafyreid6apjjy56xoyenxmg5xv356twh22n3hayecoxlf6mflpltlzpuwu",
"rev": "3lbnguuzmd42u"
},
"validationStatus": "unknown"
}
It worked! We can see in the response that the PDS was unable to validate the
record (validationStatus: unknown
) because it does not know about the
com.danielmangum.hack.*
lexicon. Nevertheless, it will agree to persist the
record. The next step was to check whether the referenced blob had been
persisted.
curl -s 'https://bsky.social/xrpc/com.atproto.sync.listBlobs?did='"$DID"''
-H 'Authorization: Bearer '"$ACCESS_JWT"'' | jq -r '.cids | length'
It looked like it had as the count had increased by 1. Fetching the blob
directly would tell us for sure. Importantly, getBlob
does not require
passing the $ACCESS_JWT
because unauthenticated parties need to be able to
fetch blobs to process alongside records that reference them.
curl 'https://bsky.social/xrpc/com.atproto.sync.getBlob?did='"$DID"'&cid=bafkreic5fmelmhqoqxfjz2siw5ey43ixwlzg5gvv2pkkz7o25ikepv4zeq'
{
"error": "Redirecting",
"message": "Redirecting to new blob location"
}
Adding -L
to the command enables following redirects.
curl -L 'https://bsky.social/xrpc/com.atproto.sync.getBlob?did='"$DID"'&cid=bafkreic5fmelmhqoqxfjz2siw5ey43ixwlzg5gvv2pkkz7o25ikepv4zeq'
<h1>This Website is Hosted on Blueskyh1>
<p>
This website is just a blob uploaded to Bluesky via the API. Curious about how
this works? Check out the write-up on <a
href="https://danielmangum.com/posts/this-website-is-hosted-on-bluesky/">danielmangum.coma>.
p>
Examining the redirect response, we can see that we are being directed directly
to my PDS.
< HTTP/2 302
< date: Sun, 24 Nov 2024 13:58:21 GMT
< content-type: application/json; charset=utf-8
< content-length: 68
< location: https://porcini.us-east.host.bsky.network/xrpc/com.atproto.sync.getBlob?did=did:plc:j22nebhg6aek3kt2mex5ng7e&cid=bafkreic5fmelmhqoqxfjz2siw5ey43ixwlzg5gvv2pkkz7o25ikepv4zeq
< x-powered-by: Express
< access-control-allow-origin: *
< ratelimit-limit: 3000
< ratelimit-remaining: 2997
< ratelimit-reset: 1732456898
< ratelimit-policy: 3000;w=300
< etag: W/"44-1je7JKzDJZFd5iRtOI+IS+zlOOE"
< vary: Accept-Encoding
Opening the location
URL
in the browser presents the website as expected, And just like that, we have a
website hosted on Bluesky! While this is not really the intended use of blobs
on Bluesky specifically, it could be a legitimate use case in the future.
Records that reference website content, code, or other binary artifacts are a
possibility on the AT Protocol. That being said, if a service like Bluesky is
running PDS instances on behalf of users, this effectively equates to free
(albiet unreliable) arbiratry file hosting, which has implications beyond just
racking up large storage and egress data fees. Returning back to the blobs
specification, there is an additional section on Security
Considerations.
Serving arbitrary user-uploaded files from a web server raises many content
security issues. For example, cross-site scripting (XSS) of scripts or SVG
content form the same “origin” as other web pages. It is effectively mandatory
to enable a Content Security Policy (LINK:
https://developer.mozilla.org/en-US/docs/Web/HTTP/CSP) for thegetBlob
endpoint. It is effectively not supported to dynamically serve assets directly
out of blob storage (thegetBlob
endpoint) directly to browsers and web
applications. Applications must proxy blobs, files, and assets through an
independent CDN, proxy, or other web service before serving to browsers and
web agents, and such services are expected to implement security precautions.
Bluesky does apply recommended CSP headers to the endpoint in the
handler,
which guards against some of the issues described.
res.setHeader('x-content-type-options', 'nosniff')
res.setHeader('content-security-policy', `default-src 'none'; sandbox`)
There is also a default size limit on blob of 5
MB.
blobUploadLimit: env.blobUploadLimit ?? 5 * 1024 * 1024, // 5mb
Images, the most common blob type on the Bluesky application, are expectedly not
served directly from PDS instances, but from the Bluesky CDN. For example, the
following URL points to the feed thumbnail version of an image I recently
uploaded as part of a post.
https://cdn.bsky.app/img/feed_thumbnail/plain/did:plc:j22nebhg6aek3kt2mex5ng7e/bafkreie5ci75iujpv34slnh3o4b7xcuxklpcguzed6qqmi4eaagn6cg4ve@jpeg
A different URL provides the full size version.
https://cdn.bsky.app/img/feed_fullsize/plain/did:plc:j22nebhg6aek3kt2mex5ng7e/bafkreie5ci75iujpv34slnh3o4b7xcuxklpcguzed6qqmi4eaagn6cg4ve@jpeg
However, the post that references the image just includes the cid
. The
application itself needs to be aware of how images are served from the CDN.
{
"$type": "app.bsky.feed.post",
"createdAt": "2024-11-12T14:18:44.263Z",
"embed": {
"$type": "app.bsky.embed.images",
"images": [
{
"alt": "Title image for blog post "USB On-The-Go on the ESP32-S3" on danielmangum.com.",
"aspectRatio": {
"height": 1080,
"width": 1920
},
"image": {
"$type": "blob",
"ref": {
"$link": "bafkreie5ci75iujpv34slnh3o4b7xcuxklpcguzed6qqmi4eaagn6cg4ve"
},
"mimeType": "image/jpeg",
"size": 869901
}
}
]
},
"facets": [
{
"features": [
{
"$type": "app.bsky.richtext.facet#link",
"uri": "https://danielmangum.com/posts/usb-otg-esp32s3/"
}
],
"index": {
"byteEnd": 261,
"byteStart": 229
}
}
],
"langs": [
"en"
],
"text": "ICYMI: This weekend I wrote about USB On-The-Go on the ESP32-S3. OTG allows devices to also act as USB hosts. I dive into how the USB PHY is configured, and demonstrate connecting two ESP32-S3's, as well as a Raspberry Pi Pico.nndanielmangum.com/posts/usb-ot..."
}
The logic is present in the
ImageUriBuilder
,
which will use a CDN if one is
configured.
const imgUriBuilder = new ImageUriBuilder(
config.cdnUrl || `${config.publicUrl}/img`,
)
So why does Bluesky provide direct unauthenticated access to the PDS getBlobs
endpoint? Once again illustrating the beauty of open source, there is an issue
describing the original
motivation. In it, image
labeling and user content export, as well as additional future use cases, are
enumerated. There is also a
mention
of the possibility of users hotlinking content and Bluesky for free hosting, so
these issues are clearly top-of-mind. The original
implementation did not
include the proper security headers, but they were subsequently
added.
Traditional social platforms can place more restrictions on blobs at time of
upload because there is a limited set of valid content. The extensibility of
Bluesky and the AT Protocol, which is what differentiates it from traditional
networks, also necessitates more complexity. However, I, and clearly the awesome
folks building Bluesky, think it’s clearly worth it.
Bonus Content
Link to heading
I mentioned sub-schemas and open unions earlier in this post. The
app.bsky.feed.post
type includes a union for valid embeds. Per the AT Protocol
lexicon specification, unions are open unless explicitly marked as closed
.
By default unions are “open”, meaning that future revisions of the schema
could add more types to the list of refs (though can not remove types). This
means that implementations should be permissive when validating, in case they
do not have the most recent version of the Lexicon. Theclosed
flag
(boolean) can indicate that the set of types is fixed and can not be extended
in the future.
The embed union is not marked as closed.
"embed": {
"type": "union",
"refs": [
"app.bsky.embed.images",
"app.bsky.embed.video",
"app.bsky.embed.external",
"app.bsky.embed.record",
"app.bsky.embed.recordWithMedia"
]
},
Therefore, posts can be created with an embed $type
that is not enumerated.
For example, I could also persist the website HTML via making a post on
Bluesky with a
custom embed.
curl -X POST 'https://bsky.social/xrpc/com.atproto.repo.createRecord'
-H 'Authorization: Bearer '"$ACCESS_JWT"''
-H 'Content-Type: application/json'
-d '{
"repo": "danielmangum.com",
"collection": "app.bsky.feed.post",
"record": {
"$type": "app.bsky.feed.post",
"text": "This post embeds a website.",
"createdAt": "2024-11-23T05:49:35.422015Z",
"embed": {
"$type": "com.danielmangum.hack.sites",
"sites": [
{
"site": {
"$type": "blob",
"ref": {
"$link": "bafkreic5fmelmhqoqxfjz2siw5ey43ixwlzg5gvv2pkkz7o25ikepv4zeq"
},
"mimeType": "text/html",
"size": 268
}
}
]
}
}
}'
{
"uri": "at://did:plc:j22nebhg6aek3kt2mex5ng7e/app.bsky.feed.post/3lbpfxwnjoq23",
"cid": "bafyreidnlyhcvlzl5hc3btih6ly5anjld6ss4bgocyichnm72cpnjuzsvu",
"commit": {
"cid": "bafyreibv77m3bdyywmotn7ncbbrqv6pv7irzmw27bzklt4tppgsoodarma",
"rev": "3lbpfxwo57q23"
},
"validationStatus": "valid"
}
In the Bluesky application, the embed is silently ignored.
However, the content is persisted and the reference is included in the post
record, so a different application could choose to start rendering the embed.
{
"$type": "app.bsky.feed.post",
"createdAt": "2024-11-23T05:49:35.422015Z",
"embed": {
"$type": "com.danielmangum.hack.sites",
"sites": [
{
"site": {
"$type": "blob",
"ref": {
"$link": "bafkreic5fmelmhqoqxfjz2siw5ey43ixwlzg5gvv2pkkz7o25ikepv4zeq"
},
"mimeType": "text/html",
"size": 268
}
}
]
},
"text": "This post embeds a website."
}
In my opinion, this is one of the most interesting features of lexicons because
it allows for “micro-extensions” that build on existing use cases (e.g.
“microblogging”). For example, I for one would love a world in which small code
snippets could be embedded in my posts and run in a
WebAssembly sandbox by other users. But that’s a
post for another day.