Discussion:
Large(ish) scale pdf file cacheing
George Wilson
2014-03-04 18:09:09 UTC
Permalink
Greetings all,
I hope this is not tip toeing off topic but I am working on solving a
problem at work right now and was hoping someone here might have some
experience/insight.

My company has a new proprietary server which generates pdf chemical safety
files via a rest API and returns them to the user. My project manager wants
a layer of separation between the website(and hence user) and the document
server so I wrote an intermediary script which accepts a request from the
website and attempts to grab a pdf from the document server via the php
curl system. That appears to be working well.

Here is the issue I am trying to solve:

We must assume a total of 1.4 million possible documents which may be
generated by this system- each initiated directly from our website. Each
document is estimated to be about a megabyte in size. Generating each one
takes at least a few seconds.

We are interested in setting up some kind of document caching system
(either a home brewed php based system or a system that generates the
files, saves them and periodically deletes them). My project manager is
concerned about web crawlers kicking off the generation of these files and
so we are considering strategies to avoid blowing out our server resources.

Does anyone have any suggestions or have you dealt with this problem before?

Thank you in advance
Bastien Koert
2014-03-04 19:01:25 UTC
Permalink
Use the filesystem and store each pdf on the file system

you can use a cron to delete the files older than x

Is each PDF substantially different or just some pertinent details like
customer name, address etc? Could you template that so that you are
generating a minimal number of PDFs?

How often does a customer come in and request a new PDF?
Post by George Wilson
Greetings all,
I hope this is not tip toeing off topic but I am working on solving a
problem at work right now and was hoping someone here might have some
experience/insight.
My company has a new proprietary server which generates pdf chemical safety
files via a rest API and returns them to the user. My project manager wants
a layer of separation between the website(and hence user) and the document
server so I wrote an intermediary script which accepts a request from the
website and attempts to grab a pdf from the document server via the php
curl system. That appears to be working well.
We must assume a total of 1.4 million possible documents which may be
generated by this system- each initiated directly from our website. Each
document is estimated to be about a megabyte in size. Generating each one
takes at least a few seconds.
We are interested in setting up some kind of document caching system
(either a home brewed php based system or a system that generates the
files, saves them and periodically deletes them). My project manager is
concerned about web crawlers kicking off the generation of these files and
so we are considering strategies to avoid blowing out our server resources.
Does anyone have any suggestions or have you dealt with this problem before?
Thank you in advance
--
Bastien

Cat, the other other white meat
George Wilson
2014-03-04 19:46:53 UTC
Permalink
Post by Bastien Koert
Use the filesystem and store each pdf on the file system
you can use a cron to delete the files older than x
Thanks for the suggestion- sounds like that might be the simplest approach.
Post by Bastien Koert
Is each PDF substantially different or just some pertinent details like
customer name, address etc? Could you template that so that you are
generating a minimal number of PDFs?
The generation is handled by a 3rd party black box application. Each
document is substantially different from others- they are chemical safety
datasheets and hence are specific to the particular chemicals they
represent. (If you are curious/interested, this page from Dow Jones
Chemicals has a brief explanation:
http://www.dow.com/productsafety/safety/sds.htm)
Post by Bastien Koert
How often does a customer come in and request a new PDF?
This is a new system so it is kind of hard to say. Some chemicals, we might
anticipate requests several times a week (perhaps several times a day).
Others, very rarely- as little as once ever. One thing we had considered is
creating an SDS hit tracker which could sort of scale the relative
importance of a particular file- then when the cron job comes around it
could take that into consideration. I am not really sure if we would see
substantial benefit over just a simple find based cron job. Customers are
not the PM's concern it is the web spiders.
Post by Bastien Koert
Post by George Wilson
Greetings all,
I hope this is not tip toeing off topic but I am working on solving a
problem at work right now and was hoping someone here might have some
experience/insight.
My company has a new proprietary server which generates pdf chemical safety
files via a rest API and returns them to the user. My project manager wants
a layer of separation between the website(and hence user) and the document
server so I wrote an intermediary script which accepts a request from the
website and attempts to grab a pdf from the document server via the php
curl system. That appears to be working well.
We must assume a total of 1.4 million possible documents which may be
generated by this system- each initiated directly from our website. Each
document is estimated to be about a megabyte in size. Generating each one
takes at least a few seconds.
We are interested in setting up some kind of document caching system
(either a home brewed php based system or a system that generates the
files, saves them and periodically deletes them). My project manager is
concerned about web crawlers kicking off the generation of these files and
so we are considering strategies to avoid blowing out our server resources.
Does anyone have any suggestions or have you dealt with this problem before?
Thank you in advance
--
Bastien
Cat, the other other white meat
Bastien Koert
2014-03-04 21:56:07 UTC
Permalink
Ok, so what I proposed should work fairly well. To keep spiders off, add a
robots.txt file to the webserver to block them, but hopefully this process
is hidden behind a password or session to prevent unnecessary generation.
This note from google might help ..
https://support.google.com/webmasters/answer/156449?hl=en
Post by George Wilson
Post by Bastien Koert
Use the filesystem and store each pdf on the file system
you can use a cron to delete the files older than x
Thanks for the suggestion- sounds like that might be the simplest approach.
Post by Bastien Koert
Is each PDF substantially different or just some pertinent details like
customer name, address etc? Could you template that so that you are
generating a minimal number of PDFs?
The generation is handled by a 3rd party black box application. Each
document is substantially different from others- they are chemical safety
datasheets and hence are specific to the particular chemicals they
represent. (If you are curious/interested, this page from Dow Jones
http://www.dow.com/productsafety/safety/sds.htm)
Post by Bastien Koert
How often does a customer come in and request a new PDF?
This is a new system so it is kind of hard to say. Some chemicals, we might
anticipate requests several times a week (perhaps several times a day).
Others, very rarely- as little as once ever. One thing we had considered is
creating an SDS hit tracker which could sort of scale the relative
importance of a particular file- then when the cron job comes around it
could take that into consideration. I am not really sure if we would see
substantial benefit over just a simple find based cron job. Customers are
not the PM's concern it is the web spiders.
Post by Bastien Koert
Post by George Wilson
Greetings all,
I hope this is not tip toeing off topic but I am working on solving a
problem at work right now and was hoping someone here might have some
experience/insight.
My company has a new proprietary server which generates pdf chemical safety
files via a rest API and returns them to the user. My project manager wants
a layer of separation between the website(and hence user) and the
document
Post by Bastien Koert
Post by George Wilson
server so I wrote an intermediary script which accepts a request from
the
Post by Bastien Koert
Post by George Wilson
website and attempts to grab a pdf from the document server via the php
curl system. That appears to be working well.
We must assume a total of 1.4 million possible documents which may be
generated by this system- each initiated directly from our website. Each
document is estimated to be about a megabyte in size. Generating each
one
Post by Bastien Koert
Post by George Wilson
takes at least a few seconds.
We are interested in setting up some kind of document caching system
(either a home brewed php based system or a system that generates the
files, saves them and periodically deletes them). My project manager is
concerned about web crawlers kicking off the generation of these files
and
Post by Bastien Koert
Post by George Wilson
so we are considering strategies to avoid blowing out our server resources.
Does anyone have any suggestions or have you dealt with this problem before?
Thank you in advance
--
Bastien
Cat, the other other white meat
--
Bastien

Cat, the other other white meat
George Wilson
2014-03-05 00:49:16 UTC
Permalink
Post by Bastien Koert
Ok, so what I proposed should work fairly well. To keep spiders off, add a
robots.txt file to the webserver to block them, but hopefully this process
is hidden behind a password or session to prevent unnecessary generation.
This note from google might help ..
https://support.google.com/webmasters/answer/156449?hl=en
I had suggested using a robots.txt. I am not certain why but the project
manager wants the robots to have the ability to crawl the pdf links. Thank
you for the reference, I will check it out!
Jan Ehrhardt
2014-03-04 22:04:03 UTC
Permalink
Post by George Wilson
Post by Bastien Koert
Use the filesystem and store each pdf on the file system
you can use a cron to delete the files older than x
Thanks for the suggestion- sounds like that might be the simplest approach.
I have got more or less the same problem. Less files (20.000), but
larger (200MB average). Far too much to store on the webserver itself.
They are video files, but that does not make the problem different.

When a user requests a file that is not on the production server, a
script fetches it from the archive (with 5TB storage). That takes half a
minute, but would in your case be some seconds.

The cron job that deletes files should not look at the creation time,
not at the last-modified time, but rather ar the time of last access (ls
-lu or PHP's fileatime). That way you will not delete files that are
requested very frequently, but only those that are requested once in a
while.

Jan
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php
George Wilson
2014-03-05 00:54:10 UTC
Permalink
Post by Jan Ehrhardt
I have got more or less the same problem. Less files (20.000), but
larger (200MB average). Far too much to store on the webserver itself.
They are video files, but that does not make the problem different.
When a user requests a file that is not on the production server, a
script fetches it from the archive (with 5TB storage). That takes half a
minute, but would in your case be some seconds.
The cron job that deletes files should not look at the creation time,
not at the last-modified time, but rather ar the time of last access (ls
-lu or PHP's fileatime). That way you will not delete files that are
requested very frequently, but only those that are requested once in a
while.
Ah, I had not thought of that. I suppose the simple solution is often best.
I highly doubt these documents will change considerably over time so I
would assume a longer cache expiration will not be a problem. Even so, we
could always have two cron jobs- one that periodically removes files with a
creation date older than a month or something.

Thank you two, these suggestions have been very helpful.
Post by Jan Ehrhardt
Jan
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php
Loading...