java - Hadoop process WARC files - Stack Overflow

Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I have a general question about Hadoop file splitting and multiple mappers. I am new to Hadoop and am trying to get a handle on how to setup for optimal performance. My project is currently processing WARC files which are GZIPed.

Using the current InputFileFormat, the file is sent to one mapper and is not split. I understand this is the correct behavior for an encrypted file. Would there be a performance benefit to decrypting the file as an intermediate step before running the job to allow the job to be split and thus use more mappers? Would that be possible? Does having more mappers create more overhead in latency or is it better to have one mapper? Thanks for your help.

Basically it depends on where you running it. if you are running it on a single machine then i dont think there will be much performance improvement. But if you are running it in a distributed environment then yes there will be. You can split your file and and send it to multiple mappers which in turn run simultaneously on other machines. So that you get the answer back faster. Suppose the program is running for 10 hours on a single machine. Now if you have 10 machines and the map it to those 10 machines, with 1 hour of execution in parallel you can view your results. – Shreyas Sarvothama Oct 30, 2016 at 5:27 Thank you for your response. I am using Amazon Elastic Map Reduce service for processing. Using the current configuration, I am only taking advantage of one mapper which means the other nodes are sitting idle which seems like a waste to me. Ideally I would want the file to be split to multiple mappers to take advantage of all the nodes that I have provisioned. I think you have answered my question about whether I should decrypt the file to local storage first so that it can be split to multiple mappers through hadoop system. – user1738628 Oct 30, 2016 at 19:57 Remember that there are tens of thousands of WARC files for each crawl, so splitting individual files probably isn't important (at least it wasn't important for my use case). FYI, I put up some sample code on github to be able to read WARC files in Hadoop/Spark - github.com/banshee/ccsparkWarc . Move your heavy parsing to a step after you read in the WARC files, and everything will be distributed normally. – James Moore Oct 20, 2020 at 17:38

Although WARC files are gzipped they are splittable (cf. Best splittable compression for Hadoop input = bz2? ) because every record has its own deflate block. But the record offsets must be known in advance.

But is this really necessary? The Common Crawl WARC files are all about 1 GB in size, it should be processed normally within max. 15 minutes. Given the overhead to launch a map task that's a reasonable time for a mapper to run. Ev., a mapper could also process a few WARC files, but it's important that you have enough splits of the input WARC file list so that all nodes are running tasks. To process a single WARC file on Hadoop would mean a lot of unnecessary overhead.

Thanks Sebastian for responding. My mapper is doing heavy parsing tasks on each record contained within the GZipped WARC file. My initial tests took around 30mins to Map and Reduce 1 GZipped file. I have tested a producer/consumer approach locally to have one thread iterate through all of the records in the stream and place on queue for consumer threads to parse out the content bodies. If I could split to have more mappers running in parallel I could potentially get the time down for each WARC Archive file down to a few minutes. Does this sound reasonable, or wrong approach? – user1738628 Oct 30, 2016 at 23:21

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question . Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.
Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers .